• Data Warehouse Service

dws
  1. Help Center
  2. Data Warehouse Service
  3. Developer Guide
  4. Data Import
  5. Concurrently Importing Data from OBS
  6. Example

Example

  1. On DWS, create the table (tpcds.customer_address) to store imported data.

    CREATE TABLE tpcds.customer_address
    (
        ca_address_sk             integer               not null,
        ca_address_id             char(16)              not null,
        ca_street_number          char(10)                      ,
        ca_street_name            varchar(60)                   ,
        ca_street_type            char(15)                      ,
        ca_suite_number           char(10)                      ,
        ca_city                   varchar(60)                   ,
        ca_county                 varchar(30)                   ,
        ca_state                  char(2)                       ,
        ca_zip                    char(10)                      ,
        ca_country                varchar(20)                   ,
        ca_gmt_offset             decimal(5,2)                  ,
        ca_location_type          char(20)
    )
    WITH (orientation = column,compression=middle)
    DISTRIBUTE BY hash (ca_address_sk);

  2. Log in to the OBS data server through the management console. On the OBS server, create the buckets /input_data1 and /input_data2 for storing data files, and create data folders /input_data1/data and /input_data2/data, respectively, in the two buckets.
  3. Evenly divide the to-be-uploaded source data files into two parts, and upload them to the /input_data1/data/ and /input_data2/data/ folders on the OBS server, respectively.
  4. On DWS, create a foreign table tpcds.customer_address_ext to receive data from the data server.

    Assume that the OBS data server properly connects to the cluster network, the server IP address is xxx.xxx.x.xx, and the source data file format is CSV. In this case, plan the parallel import as described in this example.

    Import information is set as follows:

    • Because the source data file folders are /input_data1/data/ and input_data2/data/, set location to obs://input_data1/data/ | obs://input_data2/data/.

    Information about the data format is set based on the data format parameters specified in the source data file. The parameter settings are as follows:

    • Set format to CSV.
    • Set encoding to UTF-8.
    • Set delimiter to 0E08.
    • Set quote to 0x1b.
    • Set encrypt to 'on'.
    • Set access_key to the AK you have obtained.
    • Set secret_access_key to the SK you have obtained.
      NOTE:

      The values of access_key and secret_access_key are examples only.

    The import error tolerance settings are as follows:

    • The maximum number (value of PER NODE REJECT LIMIT) of allowed data format errors is unlimited. That is, all the data format errors detected during data import are allowed.
    • The data format errors detected during data import are recorded in the customer_address_err table by specifying LOG INTO error_table_name.

    Based on the preceding settings, the foreign table is created using the following statements:

    CREATE FOREIGN TABLE tpcds.customer_address_ext
    (
    ca_address_sk             integer                       ,
    ca_address_id             char(16)                      ,
    ca_street_number          char(10)                      ,
    ca_street_name            varchar(60)                   ,
    ca_street_type            char(15)                      ,
    ca_suite_number           char(10)                      ,
    ca_city                   varchar(60)                   ,
    ca_county                 varchar(30)                   ,
    ca_state                  char(2)                       ,
    ca_zip                    char(10)                      ,
    ca_country                varchar(20)                   ,
    ca_gmt_offset             decimal(5,2)                  ,
    ca_location_type          char(20)
    )
    SERVER gsmpp_server
    OPTIONS(location 'obs://input_data1/data/ | obs://input_data2/data/',
    FORMAT 'CSV' ,
    encoding 'utf8',
    DELIMITER E'\x08',
    quote E'\x1b',
    encrypt 'on',
    ACCESS_KEY 'access_key_value_to_be_replaced',
    SECRET_ACCESS_KEY 'secret_access_key_value_to_be_replaced' 
    )LOG INTO customer_address_err PER NODE REJECT LIMIT 'unlimited';

  5. In DWS, import the data to the tpcds.customer_address table through the tpcds.customer_address_ext foreign table.

    INSERT INTO tpcds.customer_address SELECT * FROM tpcds.customer_address_ext;

  6. Query data import errors in the customer_address_err table and rectify the errors. For details, see Handling Import Errors.

    SELECT * FROM customer_address_err;

  7. After data is imported, run the ANALYZE statement to generate table statistics.

    ANALYZE tpcds.customer_address;