Training Job Failed Due to OOM

Symptom

If a training job failed due to out of memory (OOM), possible symptoms were as follows:

  1. Error code 137 is returned.

  2. The log file contained error information with keyword killed.

    **Figure 1** Error log

    Figure 1 Error log

  3. Error message "RuntimeError: CUDA out of memory." was printed in logs.

    **Figure 2** Error log

    Figure 2 Error log

  4. Error message "Dst tensor is not initialized" was printed in TensorFlow logs.

Possible Causes

The possible causes are as follows:

  • GPU memory is insufficient.

  • OOM occurred on certain nodes. This issue is typically caused by the node fault.

Solution

  1. Modify hyperparameter settings to release unnecessary tensors.

    1. Modify network parameters, such as batch_size, hide_layer, and cell_nums.

    2. Release unnecessary tensors.

      del tmp_tensor
      torch.cuda.empty_cache()
      
  2. Use the local PyCharm to remotely connect to the notebook for debugging.

  3. If the fault persists, submit a service ticket to locate the fault or even isolate the affected node.

Summary and Suggestions

Before creating a training job, use the ModelArts development environment to debug the training code to maximally eliminate errors in code migration.