Creating a Training Job

ModelArts training management enables you to create training jobs, view training statuses, and manage job versions. Model training is an iterative optimization process. Through unified training management, you can flexibly select algorithms, data, and hyperparameters to obtain the optimal input configuration and model. After comparing metrics between training versions, you can determine the most satisfactory training job.

Prerequisites

  • Training data is available. You can create a dataset in ModelArts or upload training data to the OBS directory.

  • You have created an algorithm either by using a preset image (Using a Preset Image (Custom Script)) or using a custom image (Using a Custom Image).

  • At least one empty folder has been created in OBS for storing the training output. OBS buckets are not encrypted. ModelArts does not support encrypted OBS buckets. When creating an OBS bucket, do not enable bucket encryption.

  • Access authorization has been configured. For details, see Configuring Access Authorization (Global Configuration).

Creating a Training Job

  1. Log in to the ModelArts management console.

  2. In the navigation pane, choose Training Management > Training Jobs. The training job list is displayed.

  3. Click Create Training Job. Then, configure parameters.

    Table 1 Parameters of a training job

    Parameter

    Description

    Name

    Name of a training job.

    The system automatically generates a name. You can rename it based on the following naming rules:

    • The name contains 1 to 64 characters.

    • Letters, digits, hyphens (-), and underscores (_) are allowed.

    Description

    Description of a training job.

    Table 2 Algorithm parameters of a training job

    Parameter

    Sub-Parameter

    Description

    Algorithm Type > Custom algorithm > Boot Mode

    Preset image

    If Boot Mode is set to Preset image, select a preset engine and configure the code directory and boot file.

    • Code Directory: Select the code directory required for this training job. Upload code to the OBS bucket in advance. The total size of files in the directory cannot exceed 5 GB, the number of files cannot exceed 1,000, and the folder depth cannot exceed 32.

    • Boot File: Select the Python boot script in the code directory. The boot file must a .py file because ModelArts supports only boot files written in Python.

    Algorithm Type > Custom algorithm > Boot Mode

    Custom image

    If Boot Mode is set to Custom image, specify the image, code directory, and boot command.

    • Code Directory: Select the code directory required for this training job. This parameter is optional.

      Take OBS path obs://obs-bucket/training-test/demo-code as an example. The content in the OBS path will be automatically downloaded to ${MA_JOB_DIR}/demo-code in the training container, and demo-code (customizable) is the last-level directory of the OBS path.

    • User ID: User ID for running the container. The default value 1000 is recommended. This parameter is optional.

      If the UID needs to be specified, its value must be within the specified range. The UID ranges of different resource pools are as follows:

      • Public resource pool: 1000 to 65535

      • Dedicated resource pool: 0 to 65535

    • Boot Command: Enter the image boot command. This parameter is mandatory. The boot command will be automatically executed after the code directory is downloaded.

      • If the training boot script is a .py file, train.py for example, the boot command can be python ${MA_JOB_DIR}/demo-code/train.py.

      • If the training boot script is an .sh file, main.sh for example, the boot command can be bash ${MA_JOB_DIR}/demo-code/main.sh.

      Semicolons (;) and ampersands (&&) can be used to combine multiple boot commands, but line breaks are not supported. demo-code (customizable) in the boot command is the last-level directory of the OBS path.

    Algorithm Type > Custom algorithm

    Local Code Directory

    You can specify the local directory of a training container. When a training starts, the system automatically downloads the code directory to this directory.

    The default local code directory is /home/ma-user/modelarts/user-job-dir. This parameter is optional.

    Algorithm Type > Custom algorithm

    Work Directory

    Set the directory where the boot file in the training container is located. When a training job starts, the system automatically runs the cd command to change the work directory to the specified directory.

    Created By

    My algorithms

    Select an algorithm or create an algorithm. For details, see Creating an Algorithm.

    Table 3 Parameters of training input and output

    Parameter

    Sub-Parameter

    Description

    Input

    Name

    The recommended value is data_url. The training input must match the data input configuration set in your selected algorithm. For details, see Table 2.

    For example, if you use argparse in the training code to parse data_url into the data input, set the data input parameter to data_url when creating the algorithm.

    You can select a dataset or data path for data input. When the training job is started, ModelArts automatically downloads the data in the input path to the container directory for training.

    Dataset

    Select an available dataset and its version from the ModelArts Data Management module.

    Click Dataset and select the target dataset and its version in the dialog box displayed.

    Note

    If Dataset is unavailable, the training data of the selected algorithm cannot be from a dataset.

    Data path

    Select the training data from your OBS bucket.

    Click Data path and select the OBS bucket and folder in the dialog box displayed.

    Note

    If Data path is unavailable, the training data of the selected algorithm cannot be from a data path.

    Obtained from

    The following uses training input data_path as an example.

    If you select Hyperparameters, do as follows to obtain the training input:

    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument('--data_path')
    args, unknown = parser.parse_known_args()
    data_path = args.data_path
    

    If you select Environment variables, do as follows to obtain the training input:

    import os
    data_path = os.getenv("data_path", "")
    

    Output

    Name

    The algorithm code reads the local path to the training output based on this parameter.

    The recommended value is train_url. The training output must match the data output configuration set in your selected algorithm. For details, see Table 3.

    For example, if you use argparse in the algorithm code to parse train_url into the data output, set the data output parameter to train_url when creating the algorithm.

    You can select an OBS path for data output. During training, ModelArts automatically uploads the training output to the OBS path.

    Data path

    This data path stores the training output. During and after the training, the system automatically synchronizes files from the local directory to the data path. Currently, only OBS paths can be set as the data path.

    Select the storage path of the training result (OBS path). To minimize errors, select an empty directory.

    Obtained from

    The following uses the training output train_url as an example.

    Obtain the training output from hyperparameters by using the following code:

    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument('--train_url')
    args, unknown = parser.parse_known_args()
    train_url = args.train_url
    

    Obtain the training output from environment variables by using the following code:

    import os
    train_url = os.getenv("train_url", "")
    

    Predownload

    If you set Predownload to Yes, the system automatically downloads the files in the training output data path to a local directory of the training container before the training job is started. Select Yes for resumable training and incremental training.

    Hyperparameters

    None

    The value of this parameter varies according to the selected algorithm.

    If you have defined hyperparameters when creating an algorithm, all hyperparameters of the algorithm are displayed. Whether hyperparameters can be modified or deleted depends on how you configure the constraints when creating the algorithm. For details, see Defining Hyperparameters.

    Environment Variable

    None

    Environment variables, which you can add as required. For details about the environment variables preset in the training container, see Viewing Environment Variables of a Training Container.

    Auto Restart

    None

    Number of retries for a failed training job. If this parameter is enabled, a failed training job will be automatically re-delivered and run. On the training job details page, you can view the number of retries for a failed training job.

    • This function is disabled by default.

    • If you enable this function, set the number of retries. The value ranges from 1 to 3 and cannot be changed.

    Note

    The training input, training output, and hyperparameters vary according to the selected algorithm.

    If the system displays a message for Training Input, indicating there is no input channel for the selected algorithm, you do not need to set data input on this page.

    If the system displays a message for Training Output, indicating there is no output channel for the selected algorithm, you do not need to set data output on this page.

    If the system displays a message for Hyperparameters, indicating the selected algorithm does not support custom hyperparameters, you do not need to set hyperparameters on this page.

  4. Select an instance flavor. The value range of the training parameters is consistent with the constraints of existing algorithms.

    Table 4 Resource parameters

    Parameter

    Description

    Resource Pool

    Select resource pools for the job. Public and dedicated resource pools are available for you to select.

    If you select a dedicated resource pool, you can view details about the pool. If the number of available cards of this pool is insufficient, jobs may need to be queued. In this case, use another resource pool or reduce the number of cards required.

    Note

    Dedicated resource pools can be accessed to your VPCs and subnets. For details, see (Optional) Interconnecting a VPC with a ModelArts Network.

    If you want to change the VPC accessible to your dedicated resource pool, see (Optional) Interconnecting a VPC with a ModelArts Network.

    Resource Type

    Select CPU or GPU as needed. Set this parameter based on the resource type specified in your training code.

    Specifications

    Select a resource flavor based on the resource type. If the type of resources to be used has been specified in your training code, only the options that comply with the constraints of the selected algorithm are available for you to choose. For example, if GPU is selected in the training code but you select CPU here, the training may fail.

    During training, ModelArts will mount NVME SSDs to the /cache directory. You can use this directory to store temporary files. The data disk size varies depending on the resource type. To prevent insufficient memory during training, click Check Input Size to check whether the disk size of selected instance flavor is sufficient for the input size.

    Compute Nodes

    Set the number of compute nodes. The default value is 1.

    Job Priority

    When using a new-version dedicated resource pool, you can set the priority of a training job. The value ranges from 1 to 3. The default priority is 1, and the highest priority is 3. By default, the job priority can be set to 1 or 2. After the permission to set the highest job priority is configured, the priority can be set to 1 to 3.

    You can change the priority of a pending job.

    SFS Turbo

    When a dedicated resource pool is used for training, multiple SFS Turbo file systems can be mounted for one training job.

    • Name: SFS Turbo name

    • Server Path: SFS Turbo directory

    • Local Path: mounting path of the SFS Turbo directory in the training job

    A file system can be mounted only once and to only one path. Each mount path must be unique. A maximum of 8 disks can be mounted to a training job.

    Note

    • Before mounting an SFS Turbo file system to a training job, configure the VPC and subnet where SFS Turbo is deployed to be accessible in the dedicated resource pool. For details, see .

    • The mounting path cannot be a / directory or a default mounting path, such as /cache and /home/ma-user/modelarts.

    Parallel File System

    An OBS parallel file system can be mounted to a training job to store training data. Click Add Mount Configuration and set the following parameters:

    • Storage Configuration: Select a parallel file system.

    • Mount Path: Enter the cloud mounting path in the training container.

    Persistent Log Saving

    If you select CPU or GPU flavors, Persistent Log Saving is available for you to set.

    This function is disabled by default. ModelArts automatically stores the logs for 30 days. You can download all logs on the job details page.

    After this function is enabled, select an empty OBS path for storing training logs. Ensure that you have read and write permissions to the selected OBS directory.

    Auto Stop

    • After this parameter is enabled and the auto stop time is set, a training job automatically stops at the specified time.

    • If this function is disabled, a training job will continue to run.

    • The options are 1hour, 2hours, 4hours, 6hours, and Customization (1 hour to 72 hours).

  1. Click Submit to create the training job.

    A training job generally runs for a period of time. To view the real-time status and basic information of a training job, switch to the training job list.

    • In the training job list, Status of the newly created training job is Pending.

    • When the status of a training job changes to Completed, the training job is complete, and the generated model is stored in the corresponding training output path.

    • If the status is Failed or Abnormal, click the job name to go to the job details page and view logs for troubleshooting. For details, see Training Job Details.