Creating a Training Job¶
ModelArts training management enables you to create training jobs, view training statuses, and manage job versions. Model training is an iterative optimization process. Through training management, you can flexibly select algorithms, data, and hyperparameters to obtain the optimal input configuration and model. After comparing metrics between job versions, you can determine the most satisfactory training job.
Prerequisites¶
Data is available either by creating a dataset in ModelArts or by uploading the dataset used for training to the OBS directory.
The algorithm used for training has been created using a custom script on the Algorithm Management page. For details, see Using a Custom Script.
At least one empty folder has been created in OBS for storing the training output.
The OBS directory you use and ModelArts are in the same region.
Creating a Training Job¶
Log in to the ModelArts management console. In the navigation pane, choose Settings to check whether an authorization has been configured. For details, see Configuring Access Authorization (Global Configuration).
Log in to the ModelArts management console. In the left navigation pane, choose Training Management > Training Jobs (New). The training job list is displayed by default.
Click Create Training Job, switch to the deep learning tab page, and set parameters.
Set basic parameters, including Name, Description, and Experiment. The options for Experiment are Create new, Use existing, and Not required. If you set Experiment to Create new, enter the experiment name and description.
Select an algorithm source based on the algorithm type.
¶ Parameter
Sub-Parameter
Description
Algorithms
My Algorithms
Click My Algorithms and select the target algorithm.
If no algorithm is available, click Create or Create Quickly.
Custom image
Set the image path, code directory, and boot command.
Training Input
N/A
Select an OBS path as the training input.
The training input must match the data input configuration set in your selected algorithm. For details, see Table 2. For example, if you use argparse in the training code to parse data_url into the data input, set the data input parameter to data_url when creating the algorithm. You can select a dataset or OBS path for data input. When the training job is started, ModelArts automatically downloads the data in the input path to the local container directory for training.
Dataset
Select an available dataset and its version from ModelArts Data Management.
Click Dataset and select the target dataset and its version in the dialog box displayed.
Note
If Dataset is unavailable, the training data of the selected algorithm cannot be from a dataset.
Data path
Select the training data from your OBS bucket.
Click Data path and select the OBS bucket and folder in the dialog box displayed.
Note
If OBS Path is unavailable, the training data of the selected algorithm cannot be from an OBS path.
Training Output
N/A
Select an OBS path for storing the training result. To minimize errors, select an empty directory.
The training output must match the data output configuration set in your selected algorithm. For details, see Table 3. For example, if you use argparse in the algorithm code to parse train_url into the data output, set the data output parameter to train_url when creating the algorithm. You can select an OBS path for data output. During training, ModelArts automatically uploads the training output to the OBS path.
Hyperparameters
N/A
Hyperparameters vary according to the selected algorithm.
If you have defined hyperparameters when creating an algorithm, all hyperparameters of the algorithm are displayed. Whether hyperparameters can be modified or deleted depends on how you configure the constraints when creating the algorithm. For details, see Defining Hyperparameters.
If you have enabled custom hyperparameters when creating an algorithm, you can click Add Hyperparameter to add hyperparameters for tuning.
Environment Variable
N/A
Add user-defined environment variables.
Auto Restart
N/A
After this parameter is enabled, a failed training job will be automatically re-delivered and run. Set this parameter to the number of training retries.
Note
The training input, training output, and hyperparameters vary according to the selected algorithm.
If the system displays a message for Training Input, indicating there is no input channel for the selected algorithm, you do not need to set data input on this page.
If the system displays a message for Training Output, indicating there is no output channel for the selected algorithm, you do not need to set data output on this page.
If the system displays a message for Hyperparameters, indicating the selected algorithm does not support custom hyperparameters, you do not need to set hyperparameters on this page.
Select an instance flavor. The value range of the training parameters must comply with the constraints of the selected algorithm.
¶ Parameter
Description
Resource Pool
Select a resource pool for the job. For training jobs, Public resource pools and Dedicated resource pools are available.
Resource Type
Select CPU or GPU as needed.
Instance Flavor
Select a resource flavor based on the resource type. If your algorithm has been defined to use CPUs or GPUs, only the options that comply with the constraints of the selected algorithm are available for you to choose.
The data disk capacity varies depending on the resource type. Check available memory before training. For details, see What Are Sizes of the /cache Directories for Different Resource Specifications in the Training Environment?
Compute Nodes
Set the number of compute nodes. The default value is 1.
Job Log Path
If you select CPU or GPU flavors, Persistent Log Saving is available for you to set.
After you enable this function, select a path for storing the log files generated during job running.
If you select Ascend flavors, select a path for storing the log files generated during job running.
Select an empty OBS directory for storing log files.
Note
Ensure that you have read and write permissions to the selected OBS directory.
Auto Stop
After this parameter is enabled and the auto stop time is set, a training job automatically stops at the specified time.
If this function is disabled, a training job will continue to run.
The options are 1 hour later, 2 hours later, 4 hours later, 6 hours later, and Custom.
Click Submit to create the training job.
A training job generally runs for a period of time. To view the real-time status and basic information of a training job, switch to the training job list. In the training job list, the newly created training job is in initializing state. If its status changes to Completed, the training job completes and the generated model is stored in the path specified in Training Output. If the status is Failed, click the job name to go to the job details page and view logs for troubleshooting. For details, see Viewing Job Details.