Spark Output

Overview

The Spark Output operator exports existing fields to specified columns of a Spark SQL table.

Input and Output

  • Input: fields to be exported

  • Output: SparkSQL table

Parameter Description

Table 1 Operator parameters description

Parameter

Description

Node Type

Mandatory

Default Value

Spark file storage format

SparkSQL configuration file storage format. CSV, ORC, RC and PARQUET are supported at present.

Note

  • PARQUET is a column-based storage format. In this format, the output field names of Loader be the same as the field names in the SparkSQL table.

  • For Hive of versions later than 1.2.0, a field name, instead of field number, is used to parse ORC files. Therefore, the output field names of Loader must be the same as those in the SparkSQL table.

enum

Yes

CSV

Spark file compression format

SparkSQL table file compression format. Select a format from the drop-down list. If you select NONE or do not set this parameter, data is not compressed.

enum

Yes

NONE

Spark ORC file version

Version of the ORC file (when the storage format of the SparkSQL table file is ORC).

enum

Yes

0.12

Output delimiter

Delimiter.

string

Yes

None

Output fields

Information about output fields:

  • position: Position of output fields.

  • field name: Names of output fields.

  • type: Field type. If type is set to DATE, TIME, or TIMESTAMP, you must specify a time format. If type is set to other values, the time format is invalid. An example time format is yyyyMMdd HH:mm:ss.

  • decimal format: scale and precision of the decimal.

  • length: Field value length. If the actual field value is excessively long, the value is cut based on the configured length. When type is set to CHAR, spaces are added to the field value for supplement if the actual field value length is less than the configured length. When type is set to VARCHAR, no space is added to the field value for supplement if the actual field value length is less than the configured length.

  • partition key: indicates whether a column is a partition column. You can specify zero or multiple partition columns. If multiple primary keys are configured, they are combined according to the configuration sequence.

map

Yes

None

Data Processing Rule

  • The field values are exported to the SparkSQL table.

  • If one or more columns are specified as partition columns, the Partition Handlers feature is displayed on the To page in Step 4 of the job configuration. Partition Handlers specifies the number of handlers for processing data partitioning.

  • If no column is designated as partition columns, input data does not need to be partitioned, and Partition Handlers is hidden by default.

Example

Use the CSV File Input operator to generate two fields A and B.

The following figure shows the source file.

image1

Configure the Spark Output operator to export A and B to the SparkSQL table.

image2