• MapReduce Service

mrs
  1. Help Center
  2. MapReduce Service
  3. User Guide
  4. Using MRS
  5. Using Hadoop from Scratch

Using Hadoop from Scratch

This section describes how to use Hadoop to submit a wordcount job. Wordcount, a typical Hadoop job, is used to count the words in texts.

Procedure

  1. Prepare the wordcount program.

    The open source Hadoop example program contains the wordcount program. You can download the Hadoop example program at https://dist.apache.org/repos/dist/release/hadoop/common/.

    For example, select a Hadoop version hadoop-2.7.x. Download hadoop-2.7.x.tar.gz, decompress it, and obtain hadoop-mapreduce-examples-2.7.x.jar from the hadoop-2.7.x\share\hadoop\mapreduce directory. The hadoop-mapreduce-examples-2.7.x.jar example program contains the wordcount program.

    NOTE:

    hadoop-2.7.x indicates the Hadoop version.

  2. Prepare data files.

    There is no format requirement for data files. Prepare one or more TXT files. The following is an example of a TXT file:

    qwsdfhoedfrffrofhuncckgktpmhutopmma
    jjpsffjfjorgjgtyiuyjmhombmbogohoyhm
    jhheyeombdhuaqqiquyebchdhmamdhdemmj
    doeyhjwedcrfvtgbmojiyhhqssddddddfkf
    kjhhjkehdeiyrudjhfhfhffooqweopuyyyy

  3. Upload data to OBS.

    1. Log in to the OBS console.
    2. Click Create Bucket to create a bucket and name it. The name must be unique; otherwise the bucket cannot be created. Here name wordcount will be used as an example.
    3. In the wordcount bucket, click Create Folder to create the programinputoutput, and log folders, as shown in Figure 1.
      Figure 1 Folder list
      • program: stores user programs.
      • input: stores user data files.
      • output: stores job output files.
      • log: stores job output log files.
    4. Go to the program folder, click  to select the program package downloaded in Step 1, and click Upload, as shown in Figure 2.
      Figure 2 Program list
    5. Go to the input folder and upload the data file that is prepared in Step 2, as shown in Figure 3.
      Figure 3 Data file list

  4. Log in to the MRS management console. In the navigation tree on the left, choose Clusters > Active Clusters and click the cluster named mrs_20160907. The mrs_20160907 cluster was created in section Creating a Cluster.
  5. Submit a wordcount job.

    1. Select Job Management. On the Job tab page, click Create to go to the Create Job page, as shown in Figure 4.

      Only when the mrs_20160907 cluster is in the running state can jobs be submitted.

      Figure 4 Creating a MapReduce job

      Table 1 describes parameters for job configuration. The following is a job configuration example:

      • Type: Select MapReduce.
      • Name: For example, mr_01.
      • Program Path:

        Set the path to the address that stores the program on OBS. Replace the bucket name and program name with the names of the bucket and program that you created in 3.c. For example, s3a://wordcount/program/hadoop-mapreduce-examples-2.7.x.jar.

      • Parameter:

        Indicate the main class of the program to be executed, for example, wordcount.

      • Import from:

        Set the path to the address that stores the input data files on OBS. Replace the bucket name and input name with the names of the bucket and file folder that you created in 3.c. For example, s3a://wordcount/input.

      • Export to:

        Set the path to the address that stores the job output files on OBS. Replace the bucket name and output name with the names of the bucket and file folder that you created in 3.c. For example, s3a://wordcount/output.

      • Log path:

        Set the path to the address that stores the job log files on OBS. Replace the bucket name and log name with the names of the bucket and file folder that you created in 3.c. For example, s3a://wordcount/log.

      A job will be executed immediately after being created successfully.

      Table 1 Job configuration information

      Parameter

      Description

      Type

      Job type

      Possible types include:

      • MapReduce
      • Spark
      • Spark Script
      • Hive Script
      NOTE:

      To add jobs of the Spark and Hive types, you need to select Spark and Hive components when creating a cluster and the cluster must be in the running state. Spark Script jobs support Spark SQL only, and Spark supports Spark Core and Spark SQL.

      Name

      Job name

      This parameter consists of 1 to 64 characters, including letters, digits, hyphens (-), or underscores (_). It cannot be null.

      NOTE:

      Identical job names are allowed but not recommended.

      Program Path

      Address of the JAR file of the program for executing jobs

      NOTE:

      When configuring this parameter, click OBS or HDFS, specify the file path, and click OK.

      This parameter cannot be null.

      This parameter must meet the following requirements:

      • A maximum of 1023 characters are allowed, but special characters (*?<">|\) are not allowed. The address cannot be empty or full of spaces.
      • The path varies depending on the file system:
        • OBS: The path must start with s3a://, for example, s3a://wordcount/program/hadoop-mapreduce-examples-2.7.x.jar.
        • HDFS: The path must start with /user.
      • Spark Script must end with .sql; MapReduce and Spark must end with .jar. sql and jar are case-insensitive.

      Parameter

      Key parameter for executing jobs

      This parameter is assigned by an internal function. MRS is only responsible for inputting the parameter. Separate parameters with spaces.

      Format: package name.class name

      A maximum of 2047 characters are allowed, but special characters (;|&>',<$) are not allowed. This parameter can be empty.

      NOTE:

      When you enter parameters containing sensitive information, for example, a password for login, you can add an at sign (@) before the parameters to encrypt the parameter values and prevent persistence of sensitive information in the form of plaintext. Therefore, when you view job information on the MRS management console, sensitive information will be displayed as asterisks (*).

      Example: username=admin @password=admin_123

      Import from

      Address for inputting data

      NOTE:

      When configuring this parameter, click OBS or HDFS, specify the file path, and click OK.

      The path varies depending on the file system:
      • OBS: The path must start with s3a://.
      • HDFS: The path must start with /user.

      A maximum of 1023 characters are allowed, but special characters (*?<">|\) are not allowed. This parameter can be empty.

      Export to

      Address for outputting data

      NOTE:

      When configuring this parameter, click OBS or HDFS, specify the file path, and click OK.

      The path varies depending on the file system:
      • OBS: The path must start with s3a://.
      • HDFS: The path must start with /user.

      A maximum of 1023 characters are allowed, but special characters (*?<">|\) are not allowed. This parameter can be empty.

      Log path

      Address for storing job logs that record job running status

      NOTE:

      When configuring this parameter, click OBS or HDFS, specify the file path, and click OK.

      The path varies depending on the file system:
      • OBS: The path must start with s3a://.
      • HDFS: The path must start with /user.

      A maximum of 1023 characters are allowed, but special characters (*?<">|\) are not allowed. This parameter can be empty.

  6. View the job execution results.

    1. Go to the Job Management tab page. On the Job tab page, check whether the jobs are complete.

      The job operation takes a while. After the jobs are complete, refresh the job list, as shown in Figure 5.

      Figure 5 Job list

      You cannot execute a successful or failed job, but can add or copy the job. After setting job parameters, you can submit the job again.

    2. Log in to the OBS console. Go to the OBS directory and query job output information.

      In the wordcount > output directory of OBS, you can query and download the job output files, as shown in Figure 6.

      Figure 6 Output file list
    3. Log in to the OBS console. Go to the OBS directory and check the detailed job execution results.

      In the wordcount > log directory of OBS, you can query and download the job execution logs by job ID, as shown in Figure 7.

      Figure 7 Log list

  7. Terminate a cluster.

    For details, see Terminating a Cluster in the User Guide.