Example: Using Loader to Import Data from OBS to HDFS

Scenario

If you need to import a large volume of data from the external cluster to the internal cluster, import it from OBS to HDFS.

Prerequisites

  • You have prepared service data.

  • You have created an analysis cluster.

Procedure

  1. Upload service data to your OBS file system.

  2. Obtain the AK/SK information and create an OBS and HDFS link.

    For details, see Loader Link Configuration.

  3. Access the Loader page.

    If Kerberos authentication is enabled in the analysis cluster, refer to instructions in Accessing the Hue Web UI.

  4. Click New Job.

  5. In Information, set parameters.

    1. In Name, enter a job name. For example, obs2hdfs.

    2. In From link, select the OBS link you create.

    3. In To link, select the HDFS link you create.

  6. In From, set source link parameters.

    1. In Bucket Name, enter a name of the OBS file system.

    2. In Input directory or file, enter a detailed location of service data in the file system.

      If it is a single file, enter a complete path containing the file name. If it is a directory, enter the complete path of the directory.

    3. In File format, enter the type of the service data file.

    For details, see obs-connector.

  7. In To, set destination link parameters.

    1. In Output directory, enter the directory for storing service data in HDFS.

      If Kerberos authentication is enabled in the cluster, the current user accessing Loader needs to have the permission to write data to the directory.

    2. In File format, enter the type of the service data file.

      The type must correspond to the type in 6.c.

    3. In Compression codec, enter a compression algorithm. For example, if you do not compress data, select NONE.

    4. In Overwrite, select True.

    5. Click Show Senior Parameter and set Line Separator.

    6. Set Field Separator.

    For details, see hdfs-connector.

  8. In Task Config, set job running parameters.

    1. In Extractors, enter the number of Map tasks.

    2. In Loaders, enter the number of Reduce tasks.

      If the destination link is an HDFS link, Loaders is hidden.

    3. In Max error records in single split, enter an error record threshold.

    4. In Dirty data directory, enter a directory for saving dirty data, for example, /user/sqoop/obs2hdfs-dd.

  9. Click Save and execute.

    On the Manage jobs page, view the job running result. You can click Refresh to obtain the latest job status.