• MapReduce Service

mrs
  1. Help Center
  2. MapReduce Service
  3. User Guide
  4. Using MRS
  5. Using Spark SQL from Scratch

Using Spark SQL from Scratch

To process structured data, Spark provides Spark SQL, which is similar to SQL.

You can create a table named src_data, write a data entry in each row of the src_data table, and store data in the mrs_20160907 cluster. You can then use SQL statements to query data in the src_data table. Afterward, you can delete the src_data table.

Prerequisites

You have obtained the AK/SK for writing data from the OBS data source to the Spark SQL table. The method for obtaining the AK/SK is as follows:
  1. Log in to the management console.
  2. Click the username and choose My Credential from the drop-down list.
  3. Click Access Keys.
  4. Click Add Access Key to switch to the Add Access Key page.
  5. Enter the login password, the verification code received in the email and click OK to download the access key. Keep the access key secure.

Procedure

  1. Prepare data sources for Spark SQL analysis.

    The following is an example of a text file:

    abcd3ghji
    efgh658ko
    1234jjyu9
    7h8kodfg1
    kk99icxz3

  2. Upload data to OBS.

    1. Log in to the OBS management console.
    2. Click Create Bucket to create a bucket and name it. The name must be unique or else the bucket cannot be created. Here name sparksql will be used as an example.
    3. In the sparksql bucket, click Create Folder to create the input folder.
    4. Go to the input folder, click  to select a local text file, and click Upload, as shown in Figure 1.
      Figure 1 Input file list

  3. Import the text file in OBS to HDFS.

    1. Log in to the MRS management console. In the navigation tree on the left, choose Clusters > Active Clusters and click the cluster named mrs_20160907. The mrs_20160907 cluster was created in section Creating a Cluster.
    2. Select File Management tab page.
    3. Click Create Folder and create the userinput file folder.
    4. Go to the userinput file folder, and click Import Data.
    5. Select the OBS and HDFS paths and click OK.

      OBS path: s3a://sparksql/input/sparksql-test.txt

      HDFS path: /user/userinput

  4. Submit the Spark SQL statement.

    1. On the Job Management tab page, select Spark SQL. The Spark SQL job page is displayed.

      Only when the mrs_20160907 cluster is in the running state can jobs be submitted.

    2. Enter the Spark SQL statement to create a table.

      When entering Spark SQL statements, ensure that the characters contained are fewer than 10,000.

      The syntax is as follows:

      CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name [(col_name data_type [COMMENT col_comment], ...)] [COMMENT table_comment] [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] [ROW FORMAT row_format] [STORED AS file_format] [LOCATION hdfs_path];

      You can use either of the following two methods to create a table:

      • Method 1: Create table src_data and write data in every row.

        The data source is stored in the /user/userinput file folder of HDFS: create external table src_data(line string) row format delimited fields terminated by '\\n' stored as textfile location '/user/userinput';

        The data source is stored in the /sparksql/input file folder of OBS: create external table src_data(line string) row format delimited fields terminated by '\\n' stored as textfile location 's3a://AK:SK@sparksql/input';

        For the method of obtaining the AK/SK, see the description in Prerequisites.

      • Method 2: Create table src_data1 and load data to the src_data1 table in batches.

        create table src_data1 (line string) row format delimited fields terminated by ',' ;

        load data inpath '/user/userinput/sparksql-test.txtinto table src_data1;

      NOTE:

      When method 2 is used, the data from OBS cannot be loaded to the created tables directly.

    3. Enter the Spark SQL statement to query a table.

      The syntax is as follows:

      SELECT col_name FROM table_name;

      To query data in the src_data table, for example, enter the following statement:

      select * from src_data;

    4. Enter the Spark SQL statement to delete a table.

      The syntax is as follows:

      DROP TABLE [IF EXISTS] table_name;

      For example:

      drop table src_data;

    5. Click Check to check whether the statements are correct.
    6. Click Submit.

      After submitting Spark SQL statements, you can check whether the execution is successful in Last Execution Result and view detailed execution results in Last Query Result Set.

  5. Terminate a cluster.

    For details, see Terminating a Cluster in the User Guide.