Suspension in the Last Training Epoch¶
Symptom¶
Logs showed that an error occurred in split data. As a result, processes are in different epochs, and uncompleted processes are suspended because they do not receive response from other processes. As shown in the following figure, some processes are in epoch 48 while others are in epoch 49 at the same time.
Solution¶
Ensure that all processes are in the same epoch.