Getting Started with Spark SQL

Spark provides the Spark SQL language that is similar to SQL to perform operations on structured data. This section describes how to use Spark SQL from scratch. Create a table named src_data, write a data record in each row of the table, and store the data in the mrs_20160907 cluster. Then use SQL statements to query data in the table, and delete the table at last.

Prerequisites

You have obtained the AK/SK for writing data from an OBS data source to a Spark SQL table. To obtain it, perform as follows:

  1. Log in to the management console.

  2. Click the username and select My Credentials from the drop-down list.

  3. On the displayed My Credentials page, click Access Keys.

  4. Click Create Access Key to switch to the Create Access Key dialog box.

  5. Enter the password and verification code in the email, and click OK to download the access key. Keep the access key secure.

Procedure

  1. Prepare data sources for Spark SQL analysis.

    The sample text file is as follows:

    abcd3ghji
    efgh658ko
    1234jjyu9
    7h8kodfg1
    kk99icxz3
    
  2. Upload data to OBS.

    1. Log in to OBS Console.

    2. Choose Parallel File System > Create Parallel File System to create a file system named sparksql.

      sparksql is only an example. The file system name must be globally unique. Otherwise, the parallel file system fails to be created.

    3. Click the name of the sparksql file system and click Files.

    4. Click Create Folder to create the input folder.

    5. Go to the input folder, choose Upload File > add file, select the local TXT file, and click Upload.

  3. Log in to the MRS console. In the left navigation pane, choose Clusters > Active Clusters, and click a cluster name.

  4. Import the text file from OBS to HDFS.

    1. Click the Files tab.

    2. On the HDFS File List tab page, click Create Folder, and create a folder named userinput.

    3. Go to the userinput folder, and click Import Data.

    4. Select the OBS and HDFS paths and click OK.

      OBS Path: obs://sparksql/input/sparksql-test.txt

      HDFS Path: /user/userinput

  5. Submit the SQL statement.

    1. On the MRS console, select Job Management. For details about how to submit the statement, see Running a SparkSubmit or Spark Job.

      A job can be submitted only when the mrs_20160907 cluster is in the Running state.

    2. Enter the Spark SQL statement for table creation.

      When entering Spark SQL statements, ensure that the statement characters are not more than 10,000.

      Syntax:

      CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name [(col_name data_type [COMMENT col_comment], ...)] [COMMENT table_comment] [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] [ROW FORMAT row_format] [STORED AS file_format] [LOCATION hdfs_path];

      You can use the following two methods to create a table example:

      • Method 1: Create table src_data and write data in every row.

        • The data source is stored in the /user/userinput folder of HDFS: create external table src_data***(line string) row format delimited fields terminated by '\\n' stored as textfile location* '/user/userinput';

        • The data source is stored in the /sparksql/input folder of OBS: create external table src_data(line string) row format delimited fields terminated by '\\n' stored as textfile location 'obs://AK:SK@sparksql/input';

          For details about how to obtain the AK/SK, see Prerequisites.

      • Method 2: Create table src_data1 and load data to the table in batches.

        create table src_data1 (line string) row format delimited fields terminated by ',' ;

        load data inpath '/user/userinput/sparksql-test.txt' into table src_data1;

      Note

      When method 2 is used, the data from OBS cannot be loaded to the created tables directly.

    3. Enter the Spark SQL statement for table query.

      Syntax:

      SELECT col_name FROM table_name;

      Example of querying all data in the src_data table:

      select * from src_data;

    4. Enter the Spark SQL statement for table deletion.

      Syntax:

      DROP TABLE [IF EXISTS] table_name;

      Example of deleting the src_data table:

      drop table src_data;

    5. Click Check to check the statement correctness.

    6. Click OK.

      After the Spark SQL statements are submitted, the statement execution results are displayed in the result column.

  6. Delete the cluster.