Publishing a Dataset

ModelArts distinguishes data of the same source according to versions labeled at different time, which facilitates the selection of dataset versions during subsequent model building and development. After labeling the data, you can publish the dataset to generate a new dataset version.

About Dataset Versions

  • For a newly created dataset (before publishing), there is no dataset version information. The dataset must be published before being used for model development or training.

  • The default naming rules of dataset versions are V001 and V002 in ascending order. You can customize the version number during publishing.

  • You can set any version to the current directory. Then the details of the version are displayed on the dataset details page.

  • You can obtain the dataset in the manifest file format corresponding to each dataset version based on the value of Storage Path. The dataset can be used when you import data or filter hard examples.

  • The version of a table dataset cannot be changed.

Publishing a Dataset

  1. Log in to the ModelArts management console. In the left navigation pane, choose Data Management > Datasets. The Datasets page is displayed.

  2. In the dataset list, click Publish in the Operation column.

    Alternatively, you can click the dataset name to go to the Dashboard tab page of the dataset, and click Publish in the upper right corner.

  3. In the displayed dialog box, set the parameters and click OK.

    Table 1 Parameters for publishing a dataset

    Parameter

    Description

    Version Name

    The naming rules of V001 and V002 in ascending order are used by default. A version name can be customized. Only letters, digits, hyphens (-), and underscores (_) are allowed.

    Format

    Only table datasets support version format setting. Available values are CSV and CarbonData.

    Note

    If the exported CSV file contains any command starting with =, +, -, or @, ModelArts automatically adds the Tab setting and escapes the double quotation marks (") for security purposes.

    Splitting

    Only image classification, object detection, text classification, and sound classification datasets support data splitting.

    By default, this function is disabled. After this function is enabled, you need to set the training and validation ratios.

    Enter a value ranging from 0 to 1 for Training Set Ratio. After the training set ratio is set, the validation set ratio is determined. The sum of the training set ratio and the validation set ratio is 1.

    The training set ratio is the ratio of sample data used for model training. The validation set ratio is the ratio of the sample data used for model validation. The training and validation ratios affect the performance of training templates.

    Description

    Description of the current dataset version.

    **Figure 1** Publishing a dataset

    Figure 1 Publishing a dataset

    After the version is published, you can go to the Version Manager tab page to view the detailed information. By default, the system sets the latest version to the current directory.