Terminating a Training Job¶
Function¶
This API is used to terminate a training job. Only jobs in the creating, awaiting, or running state can be terminated.
URI¶
POST /v2/{project_id}/training-jobs/{training_job_id}/actions
Parameter | Mandatory | Type | Description |
---|---|---|---|
project_id | Yes | String | Project ID. For details, see Obtaining a Project ID and Name. |
training_job_id | Yes | String | Training job ID For details about how to obtain the value, see Querying the Training Job List. |
Request Parameters¶
Parameter | Mandatory | Type | Description |
---|---|---|---|
action_type | Yes | String | Operation request for a training job. If this parameter is set to terminate, the training job is terminated. |
Response Parameters¶
Status code: 202
Parameter | Type | Description |
---|---|---|
kind | String | Training job type, which is job by default. Options:
|
metadata | JobMetadata object | Metadata of a training job. |
status | Status object | Status of a training job. You do not need to set this parameter when creating a job. |
algorithm | JobAlgorithmResponse object | Algorithm used by a training job. The options are as follows:
|
tasks | Array of TaskResponse objects | List of tasks in heterogeneous training jobs. |
spec | SpecResponce object | Specifications of a training job. |
endpoints | JobEndpointsResp object | This section describes the configurations required for remotely accessing a training job. |
Parameter | Type | Description |
---|---|---|
id | String | Training job ID, which is generated and returned by ModelArts after the training job is created. |
name | String | Name of a training job. The value must contain 1 to 64 characters consisting of only digits, letters, underscores (_), and hyphens (-). |
workspace_id | String | Workspace where a job is located. The default value is 0. |
description | String | Training job description. The value must contain 0 to 256 characters. The default value is NULL. |
create_time | Long | Time when a training job was created, in milliseconds. The value is generated and returned by ModelArts after a training job is created. |
user_name | String | Username for creating a training job. The username is generated and returned by ModelArts after a training job is created. |
annotations | Map<String,String> | Advanced configurations of a training job. The options are as follows:
|
Parameter | Type | Description |
---|---|---|
phase | String | Level-1 status of a training job. The options are:
|
secondary_phase | String | The level-2 status of a training job is an internal detailed status, which may be added, modified, or deleted. Dependency is not recommended. The options are:
|
duration | Long | Running duration of a training job, in milliseconds |
node_count_metrics | Array<Array<Integer>> | Node count changes during the training job running period. |
tasks | Array of strings | Tasks of a training job. |
start_time | Long | Start time of a training job. The value is in timestamp format. |
task_statuses | Array of TaskStatuses objects | Status of a training job task. |
running_records | Array of RunningRecord objects | Running and fault recovery records of a training job |
Parameter | Type | Description |
---|---|---|
task | String | Task of a training job. |
exit_code | Integer | Exit code of a training job task. |
message | String | Error message of a training job task. |
Parameter | Type | Description |
---|---|---|
start_at | Integer | Unix timestamp of the start time in the current running record, in seconds. |
end_at | Integer | Unix timestamp of the end time in the current running record, in seconds. |
start_type | String | Startup mode of the current running record.
|
end_reason | String | Reason why the current running record ends. |
end_related_task | String | ID of the task worker that causes the end of the current running record, for example, worker-0. |
end_recover | String | Fault tolerance policy used after the current running record ends. The enums are as follows:
|
end_recover_before_downgrade | String | Tolerance policy used after the current running record ends and before the fault tolerance policy is degraded. The options are the same as those of end_recover. |
Parameter | Type | Description |
---|---|---|
id | String | Algorithm used by a training job. The options are as follows:
|
name | String | Algorithm name. |
subscription_id | String | Subscription ID of a subscribed algorithm, which must be used with item_version_id |
item_version_id | String | Version ID of the subscribed algorithm, which must be used with subscription_id |
code_dir | String | Code directory of a training job, for example, /usr/app/. This parameter must be set together with boot_file. If id or subscription_id+item_version_id has been set for boot_file, you do not need to set this parameter. |
boot_file | String | Boot file of a training job, which needs to be stored in the code directory. for example, /usr/app/boot.py. This parameter must be used together with code_dir. If id or subscription_id+item_version_id has been set for code_dir, you do not need to set this parameter. |
autosearch_config_path | String | YAML configuration path of an auto search job. An OBS URL is required. For example, obs://bucket/file.yaml. |
autosearch_framework_path | String | Framework code directory of auto search jobs. An OBS URL is required. For example, obs://bucket/files/. |
command | String | Boot command for starting the container of a custom image for a training job. For example, python train.py. |
parameters | Array of Parameter objects | Running parameters of a training job. |
policies | policies object | Policies supported by jobs. |
inputs | Array of Input objects | Input of a training job. |
outputs | Array of Output objects | Output of a training job. |
engine | JobEngine object | Engine of a training job. Leave this parameter blank if the job is created using id of the algorithm in algorithm management, or subscription_id+item_version_id of the subscribed algorithm. |
local_code_dir | String | Local directory of the training container to which the algorithm code directory is downloaded. The rules are as follows:
|
working_dir | String | Work directory where an algorithm is executed. Note that this parameter does not take effect in v1 compatibility mode. |
environments | Array of Map<String,String> objects | Environment variables of a training job. The format is key:value. Leave this parameter blank. |
summary | Summary object | Visualization log summary. |
Parameter | Type | Description |
---|---|---|
name | String | Parameter name. |
value | String | Parameter value. |
description | String | Parameter description. |
constraint | constraint object | Parameter constraint. |
i18n_description | i18n_description object | Internationalization description. |
Parameter | Type | Description |
---|---|---|
type | String | Parameter type. |
editable | Boolean | Whether the parameter is editable. |
required | Boolean | Whether the parameter is mandatory. |
sensitive | Boolean | Whether the parameter is sensitive This function is not implemented currently. |
valid_type | String | Valid type. |
valid_range | Array of strings | Valid range. |
Parameter | Type | Description |
---|---|---|
language | String | International language. |
description | String | Description of an international language. |
Parameter | Type | Description |
---|---|---|
auto_search | auto_search object | Hyperparameter search configuration. |
Parameter | Type | Description |
---|---|---|
skip_search_params | String | Hyperparameter parameters that need to be skipped. |
reward_attrs | Array of reward_attrs objects | List of search metrics. |
search_params | Array of search_params objects | Search parameters. |
algo_configs | Array of algo_configs objects | Search algorithm configurations. |
Parameter | Type | Description |
---|---|---|
name | String | Metric name. |
mode | String | Search mode.
|
regex | String | Regular expression of a metric. |
Parameter | Type | Description |
---|---|---|
name | String | Hyperparameter name. |
param_type | String | Parameter type.
|
lower_bound | String | Lower bound of the hyperparameter. |
upper_bound | String | Upper bound of the hyperparameter. |
discrete_points_num | String | Number of discrete points of a continuous hyperparameter. |
discrete_values | Array of strings | List of discrete hyperparameter values. |
Parameter | Type | Description |
---|---|---|
name | String | Name of the search algorithm. |
params | Array of AutoSearchAlgoConfigParameter objects | Search algorithm parameters. |
Parameter | Type | Description |
---|---|---|
key | String | Parameter key. |
value | String | Parameter value. |
type | String | Parameter type. |
Parameter | Type | Description |
---|---|---|
name | String | Name of the data input channel. |
description | String | Description of the data input channel. |
local_dir | String | Local directory of the container to which the data input channel is mapped Example: /home/ma-user/modelarts/inputs/data_url_0. |
remote | InputDataInfo object | Information of the data input. Enums:
|
remote_constraint | Array of remote_constraint objects | Data input constraint |
Parameter | Type | Description |
---|---|---|
dataset | dataset object | Dataset as the data input. |
obs | obs object | OBS in which data input and output stored. |
Parameter | Type | Description |
---|---|---|
id | String | Dataset ID of a training job. |
version_id | String | Dataset version ID of a training job. |
obs_url | String | OBS URL of the dataset for a training job. It is automatically parsed by ModelArts based on the dataset ID and dataset version ID. For example, /usr/data/. |
Parameter | Type | Description |
---|---|---|
obs_url | String | OBS URL of the dataset required by a training job. For example, /usr/data/. |
Parameter | Type | Description |
---|---|---|
data_type | String | Data input type, including the data storage location and dataset. |
attributes | String | Attributes if a dataset is used as the data input. Options:
|
Parameter | Type | Description |
---|---|---|
name | String | Name of the data output channel. |
description | String | Description of the data output channel. |
local_dir | String | Local directory of the container to which the data output channel is mapped. |
remote | Remote object | Description of the actual data output. |
Parameter | Type | Description |
---|---|---|
engine_id | String | Engine ID selected for a training job. The value can be engine_id, engine_name + engine_version, or image_url. |
engine_name | String | Name of the engine selected for a training job. If engine_id has been set, you do not need to set this parameter. If you use a preset framework and custom image to create a training job, you must set both this parameter and image_url. |
engine_version | String | Version of the engine selected for a training job. If engine_id has been set, you do not need to set this parameter. |
image_url | String | Custom image URL selected for a training job. The URL is obtained from SWR. You can select an image or enter an image in the format of "Organization name/Image name:tag". |
install_sys_packages | Boolean | Whether to install the MoXing version specified by the training platform. Value true means to install the specified MoXing version. This parameter is available only when engine_name, engine_version, and image_url are set. |
Parameter | Type | Description |
---|---|---|
log_type | String | Visualization log type of a training job. After this parameter is configured, the training job can be used as the data source of a visualization job. The options are as follows:
|
log_dir | LogDir object | Visualization log output of a training job. This parameter is mandatory when log_type is not empty. |
data_sources | Array of DataSource objects | Visualization log input of a visualization job or debug training job. This parameter is mandatory when tensorboard/enable or mindstudio-insight/enable is set to true for advanced training functions. |
Parameter | Type | Description |
---|---|---|
pfs | PFSSummary object | Output of an OBS parallel file system. |
Parameter | Type | Description |
---|---|---|
pfs_path | String | URL of an OBS parallel file system. |
Parameter | Type | Description |
---|---|---|
job | JobSummary object | Job data source. |
Parameter | Type | Description |
---|---|---|
job_id | String | Training job ID. |
Parameter | Type | Description |
---|---|---|
role | String | Task role. This function is not supported currently. |
algorithm | TaskResponseAlgorithm object | Algorithm management and configuration. |
task_resource | FlavorResponse object | Flavors of a training job or an algorithm. |
Parameter | Type | Description |
---|---|---|
code_dir | String | Absolute path of the directory where the algorithm boot file is stored. |
boot_file | String | Absolute path of the algorithm boot file. |
inputs | AlgorithmInput object | Algorithm input channel. |
outputs | AlgorithmOutput object | Algorithm output channel. |
engine | AlgorithmEngine object | Engine on which a heterogeneous job depends. |
local_code_dir | String | Local directory of the training container to which the algorithm code directory is downloaded. The rules are as follows:
|
working_dir | String | Work directory where an algorithm is executed. Note that this parameter does not take effect in v1 compatibility mode. |
Parameter | Type | Description |
---|---|---|
name | String | Name of the data input channel. |
local_dir | String | Local path of the container to which the data input and output channels are mapped. |
remote | AlgorithmRemote object | Actual data input, which can only be OBS for heterogeneous jobs. |
Parameter | Type | Description |
---|---|---|
obs | RemoteObs object | OBS in which data input and output are stored. |
Parameter | Type | Description |
---|---|---|
name | String | Name of the data output channel. |
local_dir | String | Local directory of the container to which the data output channel is mapped. |
remote | Remote object | Description of the actual data output. |
mode | String | Data transmission mode. The default value is upload_periodically. |
period | String | Data transmission period. The default value is 30s. |
Parameter | Type | Description |
---|---|---|
obs_url | String | OBS URL to which data is exported. |
Parameter | Type | Description |
---|---|---|
engine_id | String | Engine ID, for example, caffe-1.0.0-python2.7. |
engine_name | String | Engine name, for example, Caffe. |
engine_version | String | Engine version. Engines with the same name have multiple versions, for example, Caffe-1.0.0-python2.7 of Python 2.7. |
v1_compatible | Boolean | Whether the v1 compatibility mode is used. |
run_user | String | User UID started by default by the engine. |
image_url | String | Custom image URL selected for an algorithm. |
Parameter | Type | Description |
---|---|---|
flavor_id | String | ID of the resource flavor. |
flavor_name | String | Name of the resource flavor. |
max_num | Integer | Maximum number of nodes in a resource flavor. |
flavor_type | String | Resource flavor type. Options:
|
billing | BillingInfo object | Billing information of a resource flavor. |
flavor_info | FlavorInfoResponse object | Resource flavor details. |
attributes | Map<String,String> | Other specification attributes. |
Parameter | Type | Description |
---|---|---|
max_num | Integer | Maximum number of nodes that can be selected. The value 1 indicates that the distributed mode is not supported. |
cpu | Cpu object | CPU specifications. |
gpu | Gpu object | GPU specifications. |
npu | Npu object | NPU specifications. |
memory | Memory object | Memory information. |
disk | DiskResponse object | Disk information. |
Parameter | Type | Description |
---|---|---|
size | Integer | Disk size. |
unit | String | Unit of the disk size. |
Parameter | Type | Description |
---|---|---|
resource | Resource object | Resource flavors of a training job. Select either flavor_id or pool_id+[flavor_id]. |
volumes | Array of JobVolume objects | Volumes attached for a training job. |
log_export_path | LogExportPath object | Export path of training job logs. |
schedule_policy | SchedulePolicy object | Training job scheduling policy. |
Parameter | Type | Description |
---|---|---|
policy | String | Resource specification mode of a training job. The value can be regular, indicating the standard mode. |
flavor_id | String | ID of the resource flavor selected for a training job. flavor_id cannot be specified for dedicated resource pools with CPU specifications. The options for dedicated resource pools with GPU specifications are as follows:
|
flavor_name | String | Read-only flavor name returned by ModelArts when flavor_id is used. |
node_count | Integer | Number of resource replicas selected for a training job. |
pool_id | String | Resource pool ID selected for a training job. |
flavor_detail | FlavorDetail object | Flavor details of a training job or algorithm. This parameter is available only for public resource pools. |
Parameter | Type | Description |
---|---|---|
flavor_type | String | Resource flavor type. The options are as follows:
|
billing | BillingInfo object | Billing information of a resource flavor. |
flavor_info | FlavorInfo object | Resource flavor details. |
Parameter | Type | Description |
---|---|---|
code | String | Billing code. |
unit_num | Integer | Billing unit. |
Parameter | Type | Description |
---|---|---|
max_num | Integer | Maximum number of nodes that can be selected. The value 1 indicates that the distributed mode is not supported. |
cpu | Cpu object | CPU specifications. |
gpu | Gpu object | GPU specifications. |
npu | Npu object | NPU specifications. |
memory | Memory object | Memory information. |
disk | Disk object | Disk information. |
Parameter | Type | Description |
---|---|---|
arch | String | CPU architecture. |
core_num | Integer | Number of cores. |
Parameter | Type | Description |
---|---|---|
unit_num | Integer | Number of GPUs. |
product_name | String | Product name. |
memory | String | Memory. |
Parameter | Type | Description |
---|---|---|
unit_num | String | Number of NPUs. |
product_name | String | Product name. |
memory | String | Memory. |
Parameter | Type | Description |
---|---|---|
size | Integer | Memory size. |
unit | String | Number of memory units. |
Parameter | Type | Description |
---|---|---|
size | String | Disk size. |
unit | String | Unit of the disk size, which is GB generally. |
Parameter | Type | Description |
---|---|---|
nfs_server_path | String | NFS server path, for example, 10.10.10.10:/example/path. |
local_path | String | Path for attaching volumes to the training container, for example, /example/path. |
read_only | Boolean | Whether the disks attached to the container in NFS mode are read-only. |
Parameter | Type | Description |
---|---|---|
obs_url | String | OBS path for storing training job logs, for example, obs://example/path. |
host_path | String | Path of the host where training job logs are stored, for example, /example/path. |
Parameter | Type | Description |
---|---|---|
required_affinity | RequiredAffinity object | Affinity requirements for training jobs. |
priority | Integer | Priority of the training job. |
preemptible | Boolean | Whether preemption is allowed |
Parameter | Type | Description |
---|---|---|
affinity_type | String | Affinity scheduling policy. Possible values are as follows:
|
affinity_group_size | Integer | Affinity group size. This parameter is mandatory when affinity_type is set to hyperinstance. In this case, the system schedules tasks specified by affinity_group_size to a supernode to form an affinity group. When a user delivers a training job to the supernode resource pool, if the affinity group size is not set, the system sets the value to 1 by default. |
Parameter | Type | Description |
---|---|---|
ssh | SSHResp object | SSH connection information. |
jupyter_lab | JupyterLab object | JupyterLab connection information. |
tensorboard | Tensorboard object | TensorBoard connection information. |
mindstudio_insight | MindStudioInsight object | MindStudio Insight connection information. |
Parameter | Type | Description |
---|---|---|
key_pair_names | Array of strings | Specifies the SSH key pair name, which can be created and viewed on the Key Pair page of the ECS console. |
task_urls | Array of TaskUrls objects | SSH connection address information. |
Parameter | Type | Description |
---|---|---|
task | String | ID of a training job. |
url | String | SSH connection address of a training job. |
Parameter | Type | Description |
---|---|---|
url | String | JupyterLab address of a training job. |
token | String | JupyterLab token of a training job. |
Parameter | Type | Description |
---|---|---|
url | String | TensorBoard URL of a training job. |
token | String | TensorBoard token of a training job |
Parameter | Type | Description |
---|---|---|
url | String | MindStudio Insight URL of a training job. |
token | String | MindStudio Insight token of a training job. |
Example Requests¶
The following is an example of how to stop the training job whose UUID is 3faf5c03-aaa1-4cbe-879d-24b05d997347.
POST https://endpoint/v2/{project_id}/training-jobs/cf63aba9-63b1-4219-b717-708a2665100b/actions
{
"action_type" : "terminate"
}
Example Responses¶
Status code: 202
ok
{
"kind" : "job",
"metadata" : {
"id" : "cf63aba9-63b1-4219-b717-708a2665100b",
"name" : "trainjob--py14_mem06-110",
"description" : "",
"create_time" : 1636515222282,
"workspace_id" : "0",
"user_name" : "ei_modelarts_z00424192_01"
},
"status" : {
"phase" : "Terminating",
"secondary_phase" : "Terminating",
"duration" : 0,
"start_time" : 0,
"node_count_metrics" : null,
"tasks" : [ "worker-0" ]
},
"algorithm" : {
"code_dir" : "obs://test/economic_test/py_minist/",
"boot_file" : "obs://test/economic_test/py_minist/minist_common.py",
"inputs" : [ {
"name" : "data_url",
"local_dir" : "/home/ma-user/modelarts/inputs/data_url_0",
"remote" : {
"obs" : {
"obs_url" : "/test/data/py_minist/"
}
}
} ],
"outputs" : [ {
"name" : "train_url",
"local_dir" : "/home/ma-user/modelarts/outputs/train_url_0",
"remote" : {
"obs" : {
"obs_url" : "/test/train_output/"
}
}
} ],
"engine" : {
"engine_id" : "pytorch-cp36-1.4.0-v2",
"engine_name" : "PyTorch",
"engine_version" : "PyTorch-1.4.0-python3.6-v2"
}
},
"spec" : {
"resource" : {
"policy" : "economic",
"flavor_id" : "modelarts.vm.pnt1.large.eco",
"flavor_name" : "Computing GPU(Pnt1) instance",
"node_count" : 1,
"flavor_detail" : {
"flavor_type" : "GPU",
"billing" : {
"code" : "modelarts.vm.gpu.pnt1.eco",
"unit_num" : 1
},
"flavor_info" : {
"cpu" : {
"arch" : "x86",
"core_num" : 8
},
"gpu" : {
"unit_num" : 1,
"product_name" : "GP-Pnt1",
"memory" : "8GB"
},
"memory" : {
"size" : 64,
"unit" : "GB"
}
}
}
}
}
}
Status Codes¶
Status Code | Description |
---|---|
202 | ok |
Error Codes¶
See Error Codes.