Creating a Dataset

To manage data using ModelArts, you need to create a dataset first. Then you can perform operations on the dataset, such as labeling data, importing data, and publishing the dataset.

Prerequisites

  • Before using the data management function, you need permissions to access OBS. This function cannot be used if you are not authorized to access OBS. Before using the data management function, go to the Settings page and complete access authorization using an agency.

  • You have created OBS buckets and folders for storing data. In addition, the OBS buckets and ModelArts are in the same region.

  • You have uploaded data to be used to OBS.

Procedure

  1. Log in to the ModelArts management console. In the left navigation pane, choose Data Management > Datasets. The Datasets page is displayed.

  2. Click Create Dataset. On the Create Dataset page, create datasets of different types based on the data type and data labeling requirements.

    1. Set the basic information, the name and description of the dataset.

      **Figure 1** Basic information about a dataset

      Figure 1 Basic information about a dataset

    2. Select a labeling scene and type as required. For details about the types supported by ModelArts, see Dataset Types.

      **Figure 2** Selecting a labeling scene and type

      Figure 2 Selecting a labeling scene and type

    3. Set the parameters based on the dataset type. For details, see the parameters of the following dataset types:

    4. Click Create in the lower right corner of the page.

      After the dataset is created, the dataset management page is displayed. You can perform the following operations on the dataset: label data, publish dataset versions, manage dataset versions, modify the dataset, import data, and delete the dataset.

Images (Image Classification and Object Detection)

**Figure 3** Parameters of datasets for image classification and object detection

Figure 3 Parameters of datasets for image classification and object detection

Table 1 Dataset parameters

Parameter

Description

Input Dataset Path

Select the OBS path to the input dataset.

Note

When you create a dataset, data in the OBS path will be imported to the dataset. If you modify data in OBS, the data in the dataset will be inconsistent with that on OBS. As a result, some data may be unavailable. If you need to modify data in a dataset, use functions described in Synchronizing Data Sources or Import Operation.

Output Dataset Path

Select the OBS path to the output dataset.

Note

The output dataset path cannot be the same as the input dataset path or cannot be the subdirectory of the input dataset path. Select an empty directory as the Output Dataset Path.

Label Set

  • Label name: Enter a label name. The label name can contain only letters, digits, underscores (_), and hyphens (-). The name contains 1 to 32 characters.

  • Add Label: Click Add Label to add more labels.

  • Setting a label color: This function is available only for datasets of the object detection type. Select a color from the color palette on the right of a label, or enter the hexadecimal color code to set the color.

  • Setting label attributes: For an object detection dataset, you can click the plus sign (+) on the right to add label attributes after setting a label color. Label attributes are used to distinguish different attributes of the objects with the same label. For example, yellow kittens and black kittens have the same label cat and their label attribute is color.

Audio (Sound Classification, Speech Labeling, and Speech Paragraph Labeling)

**Figure 4** Parameters of datasets for sound classification, speech labeling, and speech paragraph labeling

Figure 4 Parameters of datasets for sound classification, speech labeling, and speech paragraph labeling

Parameter

Description

Input Dataset Path

Select the OBS path to the input dataset.

Output Dataset Path

Select the OBS path to the output dataset.

Note

The output dataset path cannot be the same as the input dataset path or cannot be the subdirectory of the input dataset path. Select an empty directory as the Output Dataset Path.

Label Set (Sound Classification)

You need to set labels only for datasets of the sound classification type.

  • Label name: Enter a label name. The label name can contain only letters, digits, underscores (_), and hyphens (-). The name contains 1 to 32 characters.

  • Add Label: Click Add Label to add more labels.

Label Management (Speech Paragraph Labeling)

Only datasets for speech paragraph labeling support multiple labels.

  • Single Label

    A single label is used to label a piece of audio that has only one class.

    • Label Name: Enter a label name. The label name contains 1 to 32 characters. Only letters, digits, underscores (_), and hyphens (-) are allowed.

    • Label Color: Set the label color in the Label Color column. You can select a color from the color palette or enter a hexadecimal color code to set the color.

  • Multiple Labels

    Multiple labels are suitable for multi-dimensional labeling. For example, you can label a piece of audio as both noise and speech. For speech, you can label the audio with different speakers. You can click Add Label Class to add multiple label classes. A label class can contain multiple labels. The label class or name contains 1 to 32 characters including letters, digits, underscores (_), and hyphens (_).

    • Label Class: Set a label class.

    • Label Name: Enter a label name.

    • Add Label: Click Add Label to add more labels.

Speech Labeling (Speech Paragraph Labeling)

Only datasets for speech paragraph labeling support speech labeling. By default, speech labeling is disabled. If this function is enabled, you can label speech content.

Text (Text Classification, Named Entity Recognition, and Text Triplet)

**Figure 5** Parameters of datasets for text classification, named entity recognition, and text triplet

Figure 5 Parameters of datasets for text classification, named entity recognition, and text triplet

Table 2 Dataset parameters

Parameter

Description

Input Dataset Path

Select the OBS path to the input dataset.

Note

Labeled text classification data can be identified only when you import data. When creating a dataset, set an empty OBS directory. After the dataset is created, import the labeled data into it. For details about the format of the data to be imported, see Specifications for Importing Data from an OBS Directory.

Output Dataset Path

Select the OBS path to the output dataset.

Note

The output dataset path cannot be the same as the input dataset path or cannot be the subdirectory of the input dataset path. Select an empty directory as the Output Dataset Path.

Label Set (for text classification and named entity recognition)

  • Label name: Enter a label name. The label name can contain only letters, digits, underscores (_), and hyphens (-). The name contains 1 to 32 characters.

  • Add Label: Click Add Label to add more labels.

  • Setting a label color: Select a color from the color palette or enter the hexadecimal color code to set the color.

Label Set (for text triplet)

For datasets of the text triplet type, you need to set entity labels and relationship labels.

  • Entity Label: You need to set the label name and label color. You can click the plus sign (+) on the right of the color area to add multiple labels.

  • Relationship Label: A relationship label is a relationship between two entities. You need to set the source entity and target entity. You need to add at least two entity labels before adding a relationship label.

Video

**Figure 6** Parameters of datasets of the video type

Figure 6 Parameters of datasets of the video type

Table 3 Dataset parameters

Parameter

Description

Input Dataset Path

Select the OBS path to the input dataset.

Output Dataset Path

Select the OBS path to the output dataset.

Note

The output dataset path cannot be the same as the input dataset path or cannot be the subdirectory of the input dataset path. It is a good practice to select an empty directory for Output Dataset Path.

Label Set

  • Label name: Enter a label name. The label name can contain only letters, digits, underscores (_), and hyphens (-). The name contains 1 to 32 characters.

  • Add Label: Click Add Label to add more labels.

  • Setting a label color: Select a color from the color palette or enter the hexadecimal color code to set the color.

Other (Free Format)

**Figure 7** Parameters of datasets of the free format type

Figure 7 Parameters of datasets of the free format type

Table 4 Dataset parameters

Parameter

Description

Input Dataset Path

Select the OBS path to the input dataset.

Output Dataset Path

Select the OBS path to the output dataset.

Note

The output dataset path cannot be the same as the input dataset path or cannot be the subdirectory of the input dataset path. It is a good practice to select an empty directory for Output Dataset Path.