Terminating a Training Job¶
Function¶
This API is used to terminate a training job. Only jobs in the Creating, Waiting, or Running state can be terminated.
Debugging¶
You can debug this API through automatic authentication in or use the SDK sample code generated by API Explorer.
URI¶
POST /v2/{project_id}/training-jobs/{training_job_id}/actions
Parameter | Mandatory | Type | Description |
---|---|---|---|
project_id | Yes | String | Project ID. For details, see Obtaining a Project ID and Name. |
training_job_id | Yes | String | ID of a training job. |
Request Parameters¶
Parameter | Mandatory | Type | Description |
---|---|---|---|
action_type | No | String | Operation performed on a training job. Select terminate to terminate a training job. |
Response Parameters¶
Status code: 202
Parameter | Type | Description |
---|---|---|
kind | String | Training job type, which is job by default. Options:
|
metadata | JobMetadata object | Metadata of a training job. |
status | Status object | Status of a training job. You do not need to set this parameter when creating a job. |
algorithm | JobAlgorithmResponse object | Algorithm used by a training job. Options:
|
tasks | Array of TaskResponse objects | List of tasks in heterogeneous training jobs. |
spec | spec object | Specifications of a training job. |
Parameter | Type | Description |
---|---|---|
id | String | Training job ID, which is generated and returned by ModelArts after the training job is created. |
name | String | Name of a training job. The value must contain 1 to 64 characters consisting of only digits, letters, underscores (_), and hyphens (-). |
workspace_id | String | Workspace where a job is located. The default value is 0. |
description | String | Training job description. The value must contain 0 to 256 characters. The default value is NULL. |
create_time | Long | Timestamp when a training job is created, in milliseconds. The value is generated and returned by ModelArts after the job is created. |
user_name | String | Username for creating a training job. The username is generated and returned by ModelArts after the training job is created. |
annotations | Map<String,String> | Declaration template of a training job. For heterogeneous jobs, the default value of job_template is Template RL. For other jobs, the default value is Template DL. |
Parameter | Type | Description |
---|---|---|
phase | String | Level-1 status of a training job. The value is stable. Options: Creating Pending Running Failed Completed, Terminating Terminated Abnormal |
secondary_phase | String | Level-2 status of a training job. The value is unstable. Options: Creating Queuing Running Failed Completed Terminating Terminated CreateFailed TerminatedFailed Unknown Lost |
duration | Long | Running duration of a training job, in milliseconds |
node_count_metrics | Array<Array<Integer>> | Node count changes during the training job running period. |
tasks | Array of strings | Tasks of a training job. |
start_time | String | Start time of a training job. The value is in timestamp format. |
task_statuses | Array of task_statuses objects | Status of a training job task. |
Parameter | Type | Description |
---|---|---|
task | String | Name of a training job task. |
exit_code | Integer | Exit code of a training job task. |
message | String | Error message of a training job task. |
Parameter | Type | Description |
---|---|---|
id | String | Algorithm used by a training job. Options:
|
name | String | Algorithm name. |
subscription_id | String | Subscription ID of a subscribed algorithm, which must be used with item_version_id |
item_version_id | String | Version ID of the subscribed algorithm, which must be used with subscription_id |
code_dir | String | Code directory of a training job, for example, /usr/app/. This parameter must be used together with boot_file. If id or subscription_id+item_version_id is set, leave it blank. |
boot_file | String | Boot file of a training job, which must be stored in the code directory, for example, /usr/app/boot.py. This parameter must be used with code_dir. Leave this parameter blank if id, or subscription_id and item_version_id are specified. |
autosearch_config_path | String | YAML configuration path of auto search jobs. An OBS URL is required. |
autosearch_framework_path | String | Framework code directory of auto search jobs. An OBS URL is required. |
command | String | Boot command used to start the container of the custom image used by a training job. You can set this parameter to code_dir. |
parameters | Array of Parameter objects | Running parameters of a training job. |
policies | policies object | Policies supported by jobs. |
inputs | Array of Input objects | Input of a training job. |
outputs | Array of Output objects | Output of a training job. |
engine | engine object | Engine of a training job. Leave this parameter blank if the job is created using id of the algorithm in algorithm management, or subscription_id+item_version_id of the subscribed algorithm. |
local_code_dir | String | Local directory to the training container to which the algorithm code directory is downloaded. Ensure that the following rules are complied with: - The directory must be in the /home directory. - In v1 compatibility mode, the current field does not take effect. - When code_dir is prefixed with file://, the current field does not take effect. |
working_dir | String | Work directory where an algorithm is executed. Note that this parameter does not take effect in v1 compatibility mode. |
environments | Array of Map<String,String> objects | Environment variables of a training job. The format is key: value. Leave this parameter blank. |
Parameter | Type | Description |
---|---|---|
name | String | Parameter name. |
value | String | Parameter value. |
description | String | Parameter description. |
constraint | constraint object | Parameter constraint. |
i18n_description | i18n_description object | Internationalization description. |
Parameter | Type | Description |
---|---|---|
type | String | Parameter type. |
editable | Boolean | Whether the parameter is editable. |
required | Boolean | Whether the parameter is mandatory. |
sensitive | Boolean | Whether the parameter is sensitive. |
valid_type | String | Valid type. |
valid_range | Array of strings | Valid range. |
Parameter | Type | Description |
---|---|---|
language | String | Internationalization language. |
description | String | Description. |
Parameter | Type | Description |
---|---|---|
auto_search | auto_search object | Hyperparameter search configuration. |
Parameter | Type | Description |
---|---|---|
skip_search_params | String | Hyperparameter parameters that need to be skipped. |
reward_attrs | Array of reward_attrs objects | List of search metrics. |
search_params | Array of search_params objects | Search parameters. |
algo_configs | Array of algo_configs objects | Search algorithm configurations. |
Parameter | Type | Description |
---|---|---|
name | String | Metric name. |
mode | String | Search direction.
|
regex | String | Regular expression of a metric. |
Parameter | Type | Description |
---|---|---|
name | String | Hyperparameter name. |
param_type | String | Parameter type.
|
lower_bound | String | Lower bound of the hyperparameter. |
upper_bound | String | Upper bound of the hyperparameter. |
discrete_points_num | String | Number of discrete points of a continuous hyperparameter. |
discrete_values | Array of strings | List of discrete hyperparameter values. |
Parameter | Type | Description |
---|---|---|
name | String | Name of the search algorithm. |
params | Array of AutoSearchAlgoConfigParameter objects | Search algorithm parameters. |
Parameter | Type | Description |
---|---|---|
key | String | Parameter key. |
value | String | Parameter value. |
type | String | Parameter type. |
Parameter | Type | Description |
---|---|---|
name | String | Name of the data input channel. |
description | String | Description of the data input channel. |
local_dir | String | Local directory of the container to which the data input channel is mapped. |
remote | InputDataInfo object | Data input. Options:
|
remote_constraint | Array of remote_constraint objects | Data input constraint |
Parameter | Type | Description |
---|---|---|
dataset | dataset object | Dataset as the data input. |
obs | obs object | OBS in which data input and output stored. |
Parameter | Type | Description |
---|---|---|
id | String | Dataset ID of a training job. |
version_id | String | Dataset version ID of a training job. |
obs_url | String | OBS URL of the dataset required by a training job. ModelArts automatically parses and generates the URL based on the dataset and dataset version IDs. For example, /usr/data/. |
Parameter | Type | Description |
---|---|---|
obs_url | String | OBS URL of the dataset required by a training job. For example, /usr/data/. |
Parameter | Type | Description |
---|---|---|
data_type | String | Data input type, including the data storage location and dataset. |
attributes | String | Attributes if a dataset is used as the data input. Options:
|
Parameter | Type | Description |
---|---|---|
name | String | Name of the data output channel. |
description | String | Description of the data output channel. |
local_dir | String | Local directory of the container to which the data output channel is mapped. |
remote | remote object | Description of the actual data output. |
Parameter | Type | Description |
---|---|---|
obs_url | String | OBS URL to which data is actually exported. |
Parameter | Type | Description |
---|---|---|
engine_id | String | Engine ID selected for a training job. You can set this parameter to engine_id, engine_name + engine_version, or image_url. |
engine_name | String | Name of the engine selected for a training job. If engine_id is set, leave this parameter blank. |
engine_version | String | Name of the engine version selected for a training job. If engine_id is set, leave this parameter blank. |
image_url | String | Custom image URL selected for a training job. |
Parameter | Type | Description |
---|---|---|
role | String | Role of a heterogeneous training job. Options:
|
algorithm | algorithm object | Algorithm management and configuration. |
task_resource | FlavorResponse object | Flavors of a training job or an algorithm. |
Parameter | Type | Description |
---|---|---|
code_dir | String | Absolute path of the directory where the algorithm boot file is stored. |
boot_file | String | Absolute path of the algorithm boot file. |
inputs | inputs object | Algorithm input channel. |
outputs | outputs object | Algorithm output channel. |
engine | engine object | Engine on which a heterogeneous job depends. |
local_code_dir | String | Local directory to the training container to which the algorithm code directory is downloaded. Ensure that the following rules are complied with: - The directory must be in the /home directory. - In v1 compatibility mode, the current field does not take effect. - When code_dir is prefixed with file://, the current field does not take effect. |
working_dir | String | Work directory where an algorithm is executed. Note that this parameter does not take effect in v1 compatibility mode. |
Parameter | Type | Description |
---|---|---|
name | String | Name of the data input channel. |
local_dir | String | Local path of the container to which the data input and output channels are mapped. |
remote | remote object | Actual data input. Heterogeneous jobs support only OBS. |
Parameter | Type | Description |
---|---|---|
obs_url | String | OBS URL of the dataset required by a training job. For example, /usr/data/. |
Parameter | Type | Description |
---|---|---|
name | String | Name of the data output channel. |
local_dir | String | Local directory of the container to which the data output channel is mapped. |
remote | remote object | Description of the actual data output. |
mode | String | Data transmission mode. The default value is upload_periodically. |
period | String | Data transmission period. The default value is 30s. |
Parameter | Type | Description |
---|---|---|
obs_url | String | OBS URL to which data is actually exported. |
Parameter | Type | Description |
---|---|---|
engine_id | String | Engine ID of a heterogeneous job, for example, caffe-1.0.0-python2.7. |
engine_name | String | Engine name of a heterogeneous job, for example, Caffe. |
engine_version | String | Engine version of a heterogeneous job. |
v1_compatible | Boolean | Whether the v1 compatibility mode is used. |
run_user | String | User UID started by default by the engine. |
image_url | String | Custom image URL selected by an algorithm. |
Parameter | Type | Description |
---|---|---|
flavor_id | String | ID of the resource flavor. |
flavor_name | String | Name of the resource flavor. |
max_num | Integer | Maximum number of nodes in a resource flavor. |
flavor_type | String | Resource flavor type. Options:
|
billing | billing object | Billing information of a resource flavor. |
flavor_info | flavor_info object | Resource flavor details. |
attributes | Map<String,String> | Other specification attributes. |
Parameter | Type | Description |
---|---|---|
code | String | Billing code. |
unit_num | Integer | Number of billing units. |
Parameter | Type | Description |
---|---|---|
max_num | Integer | Maximum number of nodes that can be selected. The value 1 indicates that the distributed mode is not supported. |
cpu | cpu object | CPU specifications. |
gpu | gpu object | GPU specifications. |
npu | npu object | Ascend specifications |
memory | memory object | Memory information. |
disk | disk object | Disk information. |
Parameter | Type | Description |
---|---|---|
arch | String | CPU architecture. |
core_num | Integer | Number of cores. |
Parameter | Type | Description |
---|---|---|
unit_num | Integer | Number of GPUs. |
product_name | String | Product name. |
memory | String | Memory. |
Parameter | Type | Description |
---|---|---|
unit_num | String | Number of NPUs. |
product_name | String | Product name. |
memory | String | Memory. |
Parameter | Type | Description |
---|---|---|
size | Integer | Memory size. |
unit | String | Memory size |
Parameter | Type | Description |
---|---|---|
size | Integer | Disk size. |
unit | String | Unit of the disk size. |
Parameter | Type | Description |
---|---|---|
resource | Resource object | Resource flavors of a training job. Select either flavor_id or pool_id+[flavor_id]. |
volumes | Array of volumes objects | Volumes attached to a training job. |
log_export_path | log_export_path object | Export path of training job logs. |
Parameter | Type | Description |
---|---|---|
policy | String | Resource flavor of a training job. Options: regular |
flavor_id | String | Resource flavor ID of a training job. This parameter is not supported by CPU-powered dedicated resource pools. |
flavor_name | String | Read-only flavor name returned by ModelArts when flavor_id is used. |
node_count | Integer | Number of resource replicas selected for a training job. |
pool_id | String | Resource pool ID selected for a training job. |
flavor_detail | flavor_detail object | Flavors of a training job or an algorithm. |
Parameter | Type | Description |
---|---|---|
flavor_type | String | Resource flavor type. Options:
|
billing | billing object | Billing information of a resource flavor. |
flavor_info | flavor_info object | Resource flavor details. |
Parameter | Type | Description |
---|---|---|
code | String | Billing code. |
unit_num | Integer | Number of billing units. |
Parameter | Type | Description |
---|---|---|
max_num | Integer | Maximum number of nodes that can be selected. The value 1 indicates that the distributed mode is not supported. |
cpu | cpu object | CPU specifications. |
gpu | gpu object | GPU specifications. |
npu | npu object | Ascend specifications |
memory | memory object | Memory information. |
disk | disk object | Disk information. |
Parameter | Type | Description |
---|---|---|
arch | String | CPU architecture. |
core_num | Integer | Number of cores. |
Parameter | Type | Description |
---|---|---|
unit_num | Integer | Number of GPUs. |
product_name | String | Product name. |
memory | String | Memory. |
Parameter | Type | Description |
---|---|---|
unit_num | String | Number of NPUs. |
product_name | String | Product name. |
memory | String | Memory. |
Parameter | Type | Description |
---|---|---|
size | Integer | Memory size. |
unit | String | Number of memory units. |
Parameter | Type | Description |
---|---|---|
size | String | Disk size. |
unit | String | Unit of the disk size. Generally, the value is GB. |
Parameter | Type | Description |
---|---|---|
nfs_server_path | String | NFS server path. |
local_path | String | Path for attaching volumes to the training container. |
read_only | Boolean | Whether the volumes attached to the container in NFS mode are read-only. |
Parameter | Type | Description |
---|---|---|
obs_url | String | OBS URL for storing training job logs. |
host_path | String | Path of the host where training job logs are stored. |
Example Requests¶
The following shows how to stop the training job whose UUID is 3faf5c03-aaa1-4cbe-879d-24b05d997347.
POST https://endpoint/v2/{project_id}/training-jobs/cf63aba9-63b1-4219-b717-708a2665100b/actions
{
"action_type" : "terminate"
}
Example Responses¶
Status code: 202
ok
{
"kind" : "job",
"metadata" : {
"id" : "cf63aba9-63b1-4219-b717-708a2665100b",
"name" : "trainjob--py14_mem06-110",
"description" : "",
"create_time" : 1636515222282,
"workspace_id" : "0",
"user_name" : "ei_modelarts_z00424192_01"
},
"status" : {
"phase" : "Terminating",
"secondary_phase" : "Terminating",
"duration" : 0,
"start_time" : 0,
"node_count_metrics" : null,
"tasks" : [ "worker-0" ]
},
"algorithm" : {
"code_dir" : "obs://test/economic_test/py_minist/",
"boot_file" : "obs://test/economic_test/py_minist/minist_common.py",
"inputs" : [ {
"name" : "data_url",
"local_dir" : "/home/ma-user/modelarts/inputs/data_url_0",
"remote" : {
"obs" : {
"obs_url" : "/test/data/py_minist/"
}
}
} ],
"outputs" : [ {
"name" : "train_url",
"local_dir" : "/home/ma-user/modelarts/outputs/train_url_0",
"remote" : {
"obs" : {
"obs_url" : "/test/train_output/"
}
}
} ],
"engine" : {
"engine_id" : "pytorch-cp36-1.4.0-v2",
"engine_name" : "PyTorch",
"engine_version" : "PyTorch-1.4.0-python3.6-v2"
}
},
"spec" : {
"resource" : {
"policy" : "economic",
"flavor_id" : "modelarts.vm.p100.large.eco",
"flavor_name" : "Computing GPU(P100) instance",
"node_count" : 1,
"flavor_detail" : {
"flavor_type" : "GPU",
"billing" : {
"code" : "modelarts.vm.gpu.p100.eco",
"unit_num" : 1
},
"flavor_info" : {
"cpu" : {
"arch" : "x86",
"core_num" : 8
},
"gpu" : {
"unit_num" : 1,
"product_name" : "NVIDIA-P100",
"memory" : "8GB"
},
"memory" : {
"size" : 64,
"unit" : "GB"
}
}
}
}
}
}
Status Codes¶
Status Code | Description |
---|---|
202 | ok |
Error Codes¶
See Error Codes.