section>

Application Services

Database Services

Big Data and Data Analysis

Container Services

Applications and Databases

Identity and Access Management

Identity and Access Management Service

Key Management Service

Compliance

Core Services Certifications

Monitoring and Logging

Resource Management

Other

Development and Automation

Architecture Center

Blueprints

Other

Training Job Failed Due to OOM¶

Symptom¶

If a training job failed due to out of memory (OOM), possible symptoms were as follows:

Error code 137 is returned.
The log file contained error information with keyword killed.
Figure 1 Error log¶
Error message "RuntimeError: CUDA out of memory." was printed in logs.
Figure 2 Error log¶
Error message "Dst tensor is not initialized" was printed in TensorFlow logs.

Possible Causes¶

The possible causes are as follows:

GPU memory is insufficient.
OOM occurred on certain nodes. This issue is typically caused by the node fault.

Solution¶

Modify hyperparameter settings to release unnecessary tensors.
1. Modify network parameters, such as batch_size, hide_layer, and cell_nums.
2. Release unnecessary tensors.
```
del tmp_tensor
torch.cuda.empty_cache()
```
Use the local PyCharm to remotely connect to the notebook for debugging.
If the fault persists, submit a service ticket to locate the fault or even isolate the affected node.

Summary and Suggestions¶

Before creating a training job, use the ModelArts development environment to debug the training code to maximally eliminate errors in code migration.

last updated: 2024-07-25 00:37

© T-Systems International GmbH