Creating a Training Job¶
ModelArts training management enables you to create training jobs, view training statuses, and manage job versions. Model training is an iterative optimization process. Through unified training management, you can flexibly select algorithms, data, and hyperparameters to obtain the optimal input configuration and model. After comparing metrics between training versions, you can determine the most satisfactory training job.
Prerequisites¶
Training data is available. You can create a dataset in ModelArts or upload training data to the OBS directory.
You have created an algorithm either by using a preset image (Using a Preset Image (Custom Script)) or using a custom image (Using a Custom Image).
At least one empty folder has been created in OBS for storing the training output. OBS buckets are not encrypted. ModelArts does not support encrypted OBS buckets. When creating an OBS bucket, do not enable bucket encryption.
Access authorization has been configured. For details, see Configuring Access Authorization (Global Configuration).
Creating a Training Job¶
Log in to the ModelArts management console.
In the navigation pane, choose Training Management > Training Jobs. The training job list is displayed.
Click Create Training Job. Then, configure parameters.
¶ Parameter
Description
Name
Name of a training job.
The system automatically generates a name. You can rename it based on the following naming rules:
The name contains 1 to 64 characters.
Letters, digits, hyphens (-), and underscores (_) are allowed.
Description
Description of a training job.
¶ Parameter
Sub-Parameter
Description
Algorithm Type > Custom algorithm > Boot Mode
Preset image
If Boot Mode is set to Preset image, select a preset engine and configure the code directory and boot file.
Code Directory: Select the code directory required for this training job. Upload code to the OBS bucket in advance. The total size of files in the directory cannot exceed 5 GB, the number of files cannot exceed 1,000, and the folder depth cannot exceed 32.
Boot File: Select the Python boot script in the code directory. The boot file must a .py file because ModelArts supports only boot files written in Python.
Algorithm Type > Custom algorithm > Boot Mode
Custom image
If Boot Mode is set to Custom image, specify the image, code directory, and boot command.
Code Directory: Select the code directory required for this training job. This parameter is optional.
Take OBS path obs://obs-bucket/training-test/demo-code as an example. The content in the OBS path will be automatically downloaded to ${MA_JOB_DIR}/demo-code in the training container, and demo-code (customizable) is the last-level directory of the OBS path.
User ID: User ID for running the container. The default value 1000 is recommended. This parameter is optional.
If the UID needs to be specified, its value must be within the specified range. The UID ranges of different resource pools are as follows:
Public resource pool: 1000 to 65535
Dedicated resource pool: 0 to 65535
Boot Command: Enter the image boot command. This parameter is mandatory. The boot command will be automatically executed after the code directory is downloaded.
If the training boot script is a .py file, train.py for example, the boot command can be python ${MA_JOB_DIR}/demo-code/train.py.
If the training boot script is an .sh file, main.sh for example, the boot command can be bash ${MA_JOB_DIR}/demo-code/main.sh.
Semicolons (;) and ampersands (&&) can be used to combine multiple boot commands, but line breaks are not supported. demo-code (customizable) in the boot command is the last-level directory of the OBS path.
Algorithm Type > Custom algorithm
Local Code Directory
You can specify the local directory of a training container. When a training starts, the system automatically downloads the code directory to this directory.
The default local code directory is /home/ma-user/modelarts/user-job-dir. This parameter is optional.
Algorithm Type > Custom algorithm
Work Directory
Set the directory where the boot file in the training container is located. When a training job starts, the system automatically runs the cd command to change the work directory to the specified directory.
Created By
My algorithms
Select an algorithm or create an algorithm. For details, see Creating an Algorithm.
¶ Parameter
Sub-Parameter
Description
Input
Name
The recommended value is data_url. The training input must match the data input configuration set in your selected algorithm. For details, see Table 2.
For example, if you use argparse in the training code to parse data_url into the data input, set the data input parameter to data_url when creating the algorithm.
You can select a dataset or data path for data input. When the training job is started, ModelArts automatically downloads the data in the input path to the container directory for training.
Dataset
Select an available dataset and its version from the ModelArts Data Management module.
Click Dataset and select the target dataset and its version in the dialog box displayed.
Note
If Dataset is unavailable, the training data of the selected algorithm cannot be from a dataset.
Data path
Select the training data from your OBS bucket.
Click Data path and select the OBS bucket and folder in the dialog box displayed.
Note
If Data path is unavailable, the training data of the selected algorithm cannot be from a data path.
Obtained from
The following uses training input data_path as an example.
If you select Hyperparameters, do as follows to obtain the training input:
import argparse parser = argparse.ArgumentParser() parser.add_argument('--data_path') args, unknown = parser.parse_known_args() data_path = args.data_path
If you select Environment variables, do as follows to obtain the training input:
import os data_path = os.getenv("data_path", "")
Output
Name
The algorithm code reads the local path to the training output based on this parameter.
The recommended value is train_url. The training output must match the data output configuration set in your selected algorithm. For details, see Table 3.
For example, if you use argparse in the algorithm code to parse train_url into the data output, set the data output parameter to train_url when creating the algorithm.
You can select an OBS path for data output. During training, ModelArts automatically uploads the training output to the OBS path.
Data path
This data path stores the training output. During and after the training, the system automatically synchronizes files from the local directory to the data path. Currently, only OBS paths can be set as the data path.
Select the storage path of the training result (OBS path). To minimize errors, select an empty directory.
Obtained from
The following uses the training output train_url as an example.
Obtain the training output from hyperparameters by using the following code:
import argparse parser = argparse.ArgumentParser() parser.add_argument('--train_url') args, unknown = parser.parse_known_args() train_url = args.train_url
Obtain the training output from environment variables by using the following code:
import os train_url = os.getenv("train_url", "")
Predownload
If you set Predownload to Yes, the system automatically downloads the files in the training output data path to a local directory of the training container before the training job is started. Select Yes for resumable training and incremental training.
Hyperparameters
None
The value of this parameter varies according to the selected algorithm.
If you have defined hyperparameters when creating an algorithm, all hyperparameters of the algorithm are displayed. Whether hyperparameters can be modified or deleted depends on how you configure the constraints when creating the algorithm. For details, see Defining Hyperparameters.
Environment Variable
None
Environment variables, which you can add as required. For details about the environment variables preset in the training container, see Viewing Environment Variables of a Training Container.
Auto Restart
None
Number of retries for a failed training job. If this parameter is enabled, a failed training job will be automatically re-delivered and run. On the training job details page, you can view the number of retries for a failed training job.
This function is disabled by default.
If you enable this function, set the number of retries. The value ranges from 1 to 3 and cannot be changed.
Note
The training input, training output, and hyperparameters vary according to the selected algorithm.
If the system displays a message for Training Input, indicating there is no input channel for the selected algorithm, you do not need to set data input on this page.
If the system displays a message for Training Output, indicating there is no output channel for the selected algorithm, you do not need to set data output on this page.
If the system displays a message for Hyperparameters, indicating the selected algorithm does not support custom hyperparameters, you do not need to set hyperparameters on this page.
Select an instance flavor. The value range of the training parameters is consistent with the constraints of existing algorithms.
¶ Parameter
Description
Resource Pool
Select resource pools for the job. Public and dedicated resource pools are available for you to select.
If you select a dedicated resource pool, you can view details about the pool. If the number of available cards of this pool is insufficient, jobs may need to be queued. In this case, use another resource pool or reduce the number of cards required.
Note
Dedicated resource pools can be accessed to your VPCs and subnets. For details, see (Optional) Interconnecting a VPC with a ModelArts Network.
If you want to change the VPC accessible to your dedicated resource pool, see (Optional) Interconnecting a VPC with a ModelArts Network.
Resource Type
Select CPU or GPU as needed. Set this parameter based on the resource type specified in your training code.
Specifications
Select a resource flavor based on the resource type. If the type of resources to be used has been specified in your training code, only the options that comply with the constraints of the selected algorithm are available for you to choose. For example, if GPU is selected in the training code but you select CPU here, the training may fail.
During training, ModelArts will mount NVME SSDs to the /cache directory. You can use this directory to store temporary files. The data disk size varies depending on the resource type. To prevent insufficient memory during training, click Check Input Size to check whether the disk size of selected instance flavor is sufficient for the input size.
Compute Nodes
Set the number of compute nodes. The default value is 1.
Job Priority
When using a new-version dedicated resource pool, you can set the priority of a training job. The value ranges from 1 to 3. The default priority is 1, and the highest priority is 3. By default, the job priority can be set to 1 or 2. After the permission to set the highest job priority is configured, the priority can be set to 1 to 3.
You can change the priority of a pending job.
SFS Turbo
When a dedicated resource pool is used for training, multiple SFS Turbo file systems can be mounted for one training job.
Name: SFS Turbo name
Server Path: SFS Turbo directory
Local Path: mounting path of the SFS Turbo directory in the training job
A file system can be mounted only once and to only one path. Each mount path must be unique. A maximum of 8 disks can be mounted to a training job.
Note
Before mounting an SFS Turbo file system to a training job, configure the VPC and subnet where SFS Turbo is deployed to be accessible in the dedicated resource pool. For details, see .
The mounting path cannot be a / directory or a default mounting path, such as /cache and /home/ma-user/modelarts.
Parallel File System
An OBS parallel file system can be mounted to a training job to store training data. Click Add Mount Configuration and set the following parameters:
Storage Configuration: Select a parallel file system.
Mount Path: Enter the cloud mounting path in the training container.
Persistent Log Saving
If you select CPU or GPU flavors, Persistent Log Saving is available for you to set.
This function is disabled by default. ModelArts automatically stores the logs for 30 days. You can download all logs on the job details page.
After this function is enabled, select an empty OBS path for storing training logs. Ensure that you have read and write permissions to the selected OBS directory.
Auto Stop
After this parameter is enabled and the auto stop time is set, a training job automatically stops at the specified time.
If this function is disabled, a training job will continue to run.
The options are 1hour, 2hours, 4hours, 6hours, and Customization (1 hour to 72 hours).
Click Submit to create the training job.
A training job generally runs for a period of time. To view the real-time status and basic information of a training job, switch to the training job list.
In the training job list, Status of the newly created training job is Pending.
When the status of a training job changes to Completed, the training job is complete, and the generated model is stored in the corresponding training output path.
If the status is Failed or Abnormal, click the job name to go to the job details page and view logs for troubleshooting. For details, see Training Job Details.