Creating a Dataset

Function

This API is used to create a dataset.

URI

POST /v2/{project_id}/datasets

Table 1 Path Parameters

Parameter

Mandatory

Type

Description

project_id

Yes

String

Project ID. For details about how to obtain the project ID, see Obtaining a Project ID.

Request Parameters

Table 2 Request body parameters

Parameter

Mandatory

Type

Description

data_format

No

String

Data format. The options are as follows:

  • Default: default format

  • CarbonData: CarbonData (supported only by table datasets)

data_sources

No

Array of DataSource objects

Input dataset path, which is used to synchronize source data (such as images, text files, and audio files) in the directory and its subdirectories to the dataset. For a table dataset, this parameter indicates the import directory. The work directory of a table dataset cannot be an OBS path in a KMS-encrypted bucket.

dataset_name

Yes

String

Dataset name. The value contains 1 to 100 characters. Only letters, digits, underscores (_), and hyphens (-) are allowed, for example, dataset-9f3b.

dataset_type

No

Integer

Dataset type. The options are as follows:

  • 0: image classification

  • 1: object detection

  • 100: text classification

  • 101: named entity recognition

  • 102: text triplet

  • 200: sound classification

  • 201: speech content

  • 202: speech paragraph labeling

  • 400: table dataset

  • 600: video labeling

  • 900: custom format

description

No

String

Dataset description. The value is empty by default. The description contains 0 to 256 characters and does not support the following special characters: ^!<>=&"'

import_annotations

No

Boolean

Whether to automatically import the labeling information in the input directory, supporting detection, image classification, and text classification. The options are as follows:

  • true: Import labeling information in the input directory. (Default value)

  • false: Do not import labeling information in the input directory.

import_data

No

Boolean

Whether to import data. This parameter is used only for table datasets. The options are as follows:

  • true: Import data when creating a database.

  • false: Do not import data when creating a database. (Default value)

label_format

No

LabelFormat object

Label format information. This parameter is used only for text datasets.

labels

No

Array of Label objects

Dataset label list.

managed

No

Boolean

Whether to host a dataset. The options are as follows:

  • true: Host a dataset.

  • false: Do not host a dataset. (Default value)

schema

No

Array of Field objects

Schema list.

work_path

Yes

String

Output dataset path, which is used to store output files such as label files.

  • The format is /Bucket name/File path, for example, /obs-bucket/flower/rose/. (The directory is used as the path.)

  • A bucket cannot be directly used as a path.

  • The output dataset path is different from the input dataset path or its subdirectory.

  • The value contains 3 to 700 characters.

work_path_type

Yes

Integer

Type of the dataset output path. The options are as follows:

  • 0: OBS bucket (default value)

workforce_information

No

WorkforceInformation object

Team labeling information.

workspace_id

No

String

Workspace ID. If no workspace is created, the default value is 0. If a workspace is created and used, use the actual value.

Table 3 DataSource

Parameter

Mandatory

Type

Description

data_path

No

String

Data source path.

data_type

No

Integer

Data type. The options are as follows:

  • 0: OBS bucket (default value)

  • 1: GaussDB(DWS)

  • 2: DLI

  • 3: RDS

  • 4: MRS

  • 5: AI Gallery

  • 6: Inference service

schema_maps

No

Array of SchemaMap objects

Schema mapping information corresponding to the table data.

source_info

No

SourceInfo object

Information required for importing a table data source.

with_column_header

No

Boolean

Whether the first row in the file is a column name. This field is valid for the table dataset. The options are as follows:

  • true: The first row in the file is the column name.

  • false: The first row in the file is not the column name.

Table 4 SchemaMap

Parameter

Mandatory

Type

Description

dest_name

No

String

Name of the destination column.

src_name

No

String

Name of the source column.

Table 5 SourceInfo

Parameter

Mandatory

Type

Description

cluster_id

No

String

ID of an MRS cluster.

cluster_mode

No

String

Running mode of an MRS cluster. The options are as follows:

  • 0: normal cluster

  • 1: security cluster

cluster_name

No

String

Name of an MRS cluster.

database_name

No

String

Name of the database to which the table dataset is imported.

input

No

String

HDFS path of a table dataset.

ip

No

String

IP address of your GaussDB(DWS) cluster.

port

No

String

Port number of your GaussDB(DWS) cluster.

queue_name

No

String

DLI queue name of a table dataset.

subnet_id

No

String

Subnet ID of an MRS cluster.

table_name

No

String

Name of the table to which a table dataset is imported.

user_name

No

String

Username, which is mandatory for GaussDB(DWS) data.

user_password

No

String

User password, which is mandatory for GaussDB(DWS) data.

vpc_id

No

String

ID of the VPC where an MRS cluster resides.

Table 6 LabelFormat

Parameter

Mandatory

Type

Description

label_type

No

String

Label type of text classification. The options are as follows:

  • 0: The label is separated from the text, and they are distinguished by the fixed suffix _result. For example, the text file is abc.txt, and the label file is abc_result.txt.

  • 1: Default value. Labels and texts are stored in the same file and separated by separators. You can use text_sample_separator to specify the separator between the text and label and text_label_separator to specify the separator between labels.

text_label_separator

No

String

Separator between labels. By default, a comma (,) is used as the separator. The separator needs to be escaped. The separator can contain only one character, such as a letter, a digit, or any of the following special characters: !@#$%^&*_=|?/':.;,

text_sample_separator

No

String

Separator between the text and label. By default, the Tab key is used as the separator. The separator needs to be escaped. The separator can contain only one character, such as a letter, a digit, or any of the following special characters: !@#$%^&*_=|?/':.;,

Table 7 Label

Parameter

Mandatory

Type

Description

attributes

No

Array of LabelAttribute objects

Multi-dimensional attribute of a label. For example, if the label is music, attributes such as style and artist may be included.

name

No

String

Label name.

property

No

LabelProperty object

Basic attribute key-value pair of a label, such as color and shortcut keys.

type

No

Integer

Label type. The options are as follows:

  • 0: image classification

  • 1: object detection

  • 100: text classification

  • 101: named entity recognition

  • 102: text triplet relationship

  • 103: text triplet entity

  • 200: speech classification

  • 201: speech content

  • 202: speech paragraph labeling

  • 600: video classification

Table 8 LabelAttribute

Parameter

Mandatory

Type

Description

default_value

No

String

Default value of a label attribute.

id

No

String

Label attribute ID.

name

No

String

Label attribute name.

type

No

String

Label attribute type. The options are as follows:

  • text: text

  • select: single-choice drop-down list

values

No

Array of LabelAttributeValue objects

List of label attribute values.

Table 9 LabelAttributeValue

Parameter

Mandatory

Type

Description

id

No

String

Label attribute value ID.

value

No

String

Label attribute value.

Table 10 LabelProperty

Parameter

Mandatory

Type

Description

@modelarts:color

No

String

Default attribute: Label color, which is a hexadecimal code of the color. By default, this parameter is left blank. Example: #FFFFF0.

@modelarts:default_shape

No

String

Default attribute: Default shape of an object detection label (dedicated attribute). By default, this parameter is left blank. The options are as follows:

  • bndbox: rectangle

  • polygon: polygon

  • circle: circle

  • line: straight line

  • dashed: dotted line

  • point: point

  • polyline: polyline

@modelarts:from_type

No

String

Default attribute: Type of the head entity in the triplet relationship label. This attribute must be specified when a relationship label is created. This parameter is used only for the text triplet dataset.

@modelarts:rename_to

No

String

Default attribute: The new name of the label.

@modelarts:shortcut

No

String

Default attribute: Label shortcut key. By default, this parameter is left blank. For example: D.

@modelarts:to_type

No

String

Default attribute: Type of the tail entity in the triplet relationship label. This attribute must be specified when a relationship label is created. This parameter is used only for the text triplet dataset.

Table 11 Field

Parameter

Mandatory

Type

Description

description

No

String

Schema description.

name

No

String

Schema name.

schema_id

No

Integer

Schema ID.

type

No

String

Schema value type.

Table 12 WorkforceInformation

Parameter

Mandatory

Type

Description

data_sync_type

No

Integer

Synchronization type. The options are as follows:

  • 0: not to be synchronized

  • 1: data to be synchronized

  • 2: label to be synchronized

  • 3: data and label to be synchronized

repetition

No

Integer

Number of persons who label each sample. The minimum value is 1.

synchronize_auto_labeling_data

No

Boolean

Whether to synchronously update auto labeling data. The options are as follows:

  • true: Update auto labeling data synchronously.

  • false: Do not update auto labeling data synchronously.

synchronize_data

No

Boolean

Whether to synchronize updated data, such as uploading files, synchronizing data sources, and assigning imported unlabeled files to team members. The options are as follows:

  • true: Synchronize updated data to team members.

  • false: Do not synchronize updated data to team members.

task_id

No

String

ID of a team labeling task.

task_name

Yes

String

Name of a team labeling task. The value contains 1 to 64 characters, including only letters, digits, underscores (_), and hyphens (-).

workforces_config

No

WorkforcesConfig object

Manpower assignment of a team labeling task. You can delegate the team administrator to assign the manpower or do it by yourself.

Table 13 WorkforcesConfig

Parameter

Mandatory

Type

Description

agency

No

String

Team administrator.

workforces

No

Array of WorkforceConfig objects

List of teams that execute labeling tasks.

Table 14 WorkforceConfig

Parameter

Mandatory

Type

Description

workers

No

Array of Worker objects

List of labeling team members.

workforce_id

No

String

ID of a labeling team.

workforce_name

No

String

Name of a labeling team. The value contains 0 to 1024 characters and does not support the following special characters: !<>=&"'

Table 15 Worker

Parameter

Mandatory

Type

Description

create_time

No

Long

Creation time.

description

No

String

Labeling team member description. The value contains 0 to 256 characters and does not support the following special characters: ^!<>=&"'

email

No

String

Email address of a labeling team member.

role

No

Integer

Role. The options are as follows:

  • 0: labeling personnel

  • 1: reviewer

  • 2: team administrator

  • 3: dataset owner

status

No

Integer

Current login status of a labeling team member. The options are as follows:

  • 0: The invitation email has not been sent.

  • 1: The invitation email has been sent but the user has not logged in.

  • 2: The user has logged in.

  • 3: The labeling team member has been deleted.

update_time

No

Long

Update time.

worker_id

No

String

ID of a labeling team member.

workforce_id

No

String

ID of a labeling team.

Response Parameters

Status code: 201

Table 16 Response body parameters

Parameter

Type

Description

dataset_id

String

Dataset ID.

error_code

String

Error code.

error_msg

String

Error message.

import_task_id

String

ID of an import task.

Example Requests

  • Creating an Image Classification Dataset

    {
      "workspace_id" : "0",
      "dataset_name" : "dataset-457f",
      "dataset_type" : 0,
      "data_sources" : [ {
        "data_type" : 0,
        "data_path" : "/test-obs/classify/input/cat-rabbit/"
      } ],
      "description" : "",
      "work_path" : "/test-obs/classify/output/",
      "work_path_type" : 0,
      "labels" : [ {
        "name" : "Cat",
        "type" : 0,
        "property" : {
          "@modelarts:color" : "#3399ff"
        }
      }, {
        "name" : "Rabbit",
        "type" : 0,
        "property" : {
          "@modelarts:color" : "#3399ff"
        }
      } ]
    }
    
  • Creating an Object Detection Dataset

    {
      "workspace_id" : "0",
      "dataset_name" : "dataset-95a6",
      "dataset_type" : 1,
      "data_sources" : [ {
        "data_type" : 0,
        "data_path" : "/test-obs/detect/input/cat-rabbit/"
      } ],
      "description" : "",
      "work_path" : "/test-obs/detect/output/",
      "work_path_type" : 0,
      "labels" : [ {
        "name" : "Cat",
        "type" : 1,
        "property" : {
          "@modelarts:color" : "#3399ff"
        }
      }, {
        "name" : "Rabbit",
        "type" : 1,
        "property" : {
          "@modelarts:color" : "#3399ff"
        }
      } ]
    }
    
  • Creating a Table Dataset

    {
      "workspace_id" : "0",
      "dataset_name" : "dataset-de83",
      "dataset_type" : 400,
      "data_sources" : [ {
        "data_type" : 0,
        "data_path" : "/test-obs/table/input/",
        "with_column_header" : true
      } ],
      "description" : "",
      "work_path" : "/test-obs/table/output/",
      "work_path_type" : 0,
      "schema" : [ {
        "schema_id" : 1,
        "name" : "150",
        "type" : "STRING"
      }, {
        "schema_id" : 2,
        "name" : "4",
        "type" : "STRING"
      }, {
        "schema_id" : 3,
        "name" : "setosa",
        "type" : "STRING"
      }, {
        "schema_id" : 4,
        "name" : "versicolor",
        "type" : "STRING"
      }, {
        "schema_id" : 5,
        "name" : "virginica",
        "type" : "STRING"
      } ],
      "import_data" : true
    }
    

Example Responses

Status code: 201

Created

{
  "dataset_id" : "WxCREuCkBSAlQr9xrde"
}

Status Codes

Status Code

Description

201

Created

401

Unauthorized

403

Forbidden

404

Not Found

Error Codes

See Error Codes.