Using Loader to Export Data¶
Scenario¶
This task enables you to export data from MRS to external data sources.
Generally, users can manually manage data import and export jobs on the Loader UI. To use shell scripts to update and run Loader jobs, you must configure the installed Loader client.
Prerequisites¶
You have obtained the service user name and password for creating a Loader job.
You have had the permission to access the HDFS directories, HBase tables, and data involved in job execution.
You have obtained the user name and password used by an external data source (SFTP server or relational database).
No disk space alarm is reported, and the available disk space is sufficient for importing and exporting data.
When using Loader to export data from HDFS/OBS, the input paths and input path subdirectories of HDFS/OBS data source and the name of the files in these directories do not contain any of the following special characters: /\"':;,.
If the job requires the Yarn queue function, the user must be authorized with related Yarn queue permission.
The user who configures a task must obtain execution permission on the task and obtain usage permission on the related connection of the task.
Procedure¶
Check whether data is exported from Loader to a relational database for the first time.
Modify the permission on the JAR package of the RDS driver.
Obtain the JAR package of the relational database driver and save it to the following directory on the active and standby Loader nodes: ${BIGDATA_HOME}/FusionInsight_Porter_8.1.2.2/install/FusionInsight-Sqoop-1.99.3/FusionInsight-Sqoop-1.99.3/server/webapps/loader/WEB-INF/ext-lib.
Run the following command on the active and standby nodes as user root to modify the permission:
cd ${BIGDATA_HOME}/FusionInsight_Porter_8.1.2.2/install/FusionInsight-Sqoop-1.99.3/FusionInsight-Sqoop-1.99.3/server/webapps/loader/WEB-INF/ext-lib
chown omm:wheel JAR package name
chmod 600 JAR package name
Log in to FusionInsight Manager. Choose Cluster > Name of the desired cluster > Service > Loader > More > Restart. Enter the password of the system administrator to restart the Loader service.
Access the Loader web UI.
Log in to FusionInsight Manager. For details, see Accessing FusionInsight Manager.
Choose Cluster > Name of the desired cluster > Services > Loader.
Click LoaderServer(Node name, Active). The Loader web UI is displayed.
Create a Loader data import job. Click New Job. Select the required job type in 1. Basic Information and click Next.
Set Name to the job name and Type to Export.
Select a connection for Connection. By default, no connection is created. Click Add to create a connection, and then click Test to test whether the connection is available. Click OK when the system displays a message indicates that the test is successful.
¶ Connector Type
Parameter
Description
generic-jdbc-connector
JDBC Driver Class
Specifies the name of a JDBC driver class.
JDBC Connection String
Specifies the JDBC connection string.
Username
Specifies the username for connecting to the database.
Password
Specifies the password for connecting to the database.
JDBC Connection Properties
Specifies JDBC connection attributes. Click Add to manually add connection attributes.
Name: connection attribute name
Value: connection attribute value
hdfs-connector
-
-
oracle-connector
JDBC Connection String
Specifies connection string for a user to connect to the database.
Username
Specifies the username for connecting to the database.
Password
Specifies the password for connecting to the database.
Connection Properties
Specifies connection attributes. Click Add to manually add connection attributes.
Name: connection attribute name
Value: connection attribute value
mysql-fastpath-connector
JDBC Connection String
Specifies the JDBC connection string.
Username
Specifies the username for connecting to the database.
Password
Specifies the password for connecting to the database.
Connection Properties
Specifies connection attributes. Click Add to manually add connection attributes.
Name: connection attribute name
Value: connection attribute value
sftp-connector
SFTP Server IP
Specifies the IP address of the SFTP server.
SFTP Server Port
Specifies the port number of the SFTP server.
SFTP Username
Specifies the username for accessing the SFTP server.
SFTP Password
Specifies the password for accessing the SFTP server.
SFTP Public Key
Specifies public key of the SFTP server.
oracle-partition-connector
JDBC Driver Class
Specifies the name of a Java database connectivity (JDBC) driver class.
JDBC Connection String
Specifies the JDBC connection string.
Username
Specifies the username for connecting to the database.
Password
Specifies the password for connecting to the database.
Connection Properties
Specifies connection attributes. Click Add to manually add connection attributes.
Name: connection attribute name
Value: connection attribute value
Set Group to the group to which the job belongs. By default, there is no created group. Click Add to create a group and click OK.
Queue indicates that Loader tasks are executed in a specified Yarn queue. The default value is root.default, which indicates that the tasks are executed in the default queue.
Set Priority to the priority of Loader tasks in the specified Yarn queue. The options can be VERY_LOW, LOW, NORMAL, HIGH, or VERY_HIGH. The default value is NORMAL.
In the 2. Input Settings area, set the data source and click Next.
Note
When creating or editing a Loader job, you can use macro definitions when configuring parameters such as the SFTP path, HDFS/OBS path, and Where condition of SQL. For details, see Using Macro Definitions in Configuration Items.
¶ Source File Type
Parameter
Description
HDFS/OBS
Input Directory
Specifies the input path when data is exported from HDFS/OBS.
Path Filter
Specifies the wildcard for filtering the directories in the input paths of the source files. Input Directory is not used in filtering. If there are multiple filter conditions, use commas (,) to separate them. If the value is empty, the directory is not filtered. The regular expression filtering is not supported.
File Filter
Specifies the wildcard for filtering the file names of the source files. If there are multiple filter conditions, use commas (,) to separate them. The value cannot be left blank. The regular expression filtering is not supported.
File Type
Specifies the file import type.
TEXT_FILE: imports a text file and stores it as a text file.
SEQUENCE_FILE: imports a text file and stores it as a sequence file.
BINARY_FILE: imports files of any format by using binary streams.
File Split Type
Specifies whether to split source files by file name or size. The files obtained after the splitting are used as the input files of each map in the MapReduce task for data export.
Extractors
Specifies the number of maps that are started at the same time in a MapReduce job of a data configuration operation. This parameter cannot be set when Extractor Size is set. The value must be less than or equal to 3000.
Extractor size
Specifies the size of data processed by maps that are started in a MapReduce job of a data configuration operation. The unit is MB. The value must be greater than or equal to 100. The recommended value is 1000. This parameter cannot be set when Extractors is set. When a relational database connector is used, Extractor size is unavailable. You need to set Extractors.
HBASE
HBase Instance
Specifies the HBase service instance that Loader selects from all available HBase service instances in the cluster. If the selected HBase service instance is not added to the cluster, the HBase job cannot be run properly.
Quantity
Specifies the number of maps that are started at the same time in a MapReduce job of a data configuration operation. The value must be less than or equal to 3000.
HIVE
Hive instance
Specifies the Hive service instance that Loader selects from all available Hive service instances in the cluster. If the selected Hive service instance is not added to the cluster, the Hive job cannot run properly.
Quantity
Specifies the number of maps that are started at the same time in a MapReduce job of a data configuration operation. The value must be less than or equal to 3000.
SPARK
Spark instance
Only SparkSQL can access Hive data. Specifies the SparkSQL service instance that Loader selects from all available SparkSQL service instances in the cluster. If the selected Spark service instance is not added to the cluster, the Spark job cannot be run properly.
Quantity
Specifies the number of maps that are started at the same time in a MapReduce job of a data configuration operation. The value must be less than or equal to 3000.
In the 3. Convert area, set the conversion operations during data transmission.
Check whether source data values in the data operation job created by the Loader can be directly used without conversion, including upper and lower case conversion, cutting, merging, and separation.
No created conversion step exists by default. Drag an example conversion step on the left to the edit box to create a new conversion step.
Conversion step types must be selected based on service requirements. A complete conversion process includes the following types:
Input type. Only one conversion step can be added. This parameter is mandatory if the task involves HBase or relational databases.
Conversion type, which is an intermediate conversion step. You can add one or more conversion types or do not add any conversion type.
Output type. Only one output type can be added in the last conversion step. This parameter is mandatory if the task involves HBase or relational databases.
¶ Type
Description
Input Type
CSV File Input: CSV file input step for configuring separators to generate multiple fields.
Fixed-Width File Input: Text file input step for configuring the length of characters or bytes to be truncated to generate multiple fields.
Table Input: relational data input step for configuring specified columns in the database as input fields.
HBase Input: HBase table input step for configuring the column definition of an HBase table to a specified field.
HTML Input: HTML web page data input step for obtaining the target data of the HTML web page file to the specified field.
Hive Input: Hive table input step for defining columns in a Hive table to specified fields.
Spark Input: Spark SQL table input step for defining columns in the SparkSQL table to specified fields. Only Hive data can be stored and accessed.
Conversion type
Long Integer Time Conversion: Configure the conversion between a long integer value and a date.
Null Value Conversion: Configure a specified value to replace the null value.
Random Value Conversion: Configure new value-added fields as random data fields.
Adding a Constant Field: Add a constant to directly generate a constant field.
Concatenation and Conversion: Concatenate fields, connect generated fields using connection characters, and convert new fields.
Separator Conversion: Configure the generated fields to be separated by separators and convert new fields.
Modulo Conversion: Configure the generated fields to be converted into new fields through modulo operation.
Cutting Character String: Truncate a generated field based on a specified position to generate a new field.
EL Operation Conversion: Calculate field values. Currently, the following operators are supported: md5sum, sha1sum, sha256sum, and sha512sum.
Character String Case Conversion: Configure the generated fields to be converted to new fields through case conversion.
Reverse String Conversion: Reverse the generated fields to generate new fields.
Character String Space Clearing Conversion: Configure the generated fields to clear spaces and convert them to new fields.
Row Filtering Conversion: Configure logical conditions to filter out rows that contain triggering conditions.
Update Fields: Update the value of a specified field when certain conditions are met.
Output type
File Output: Configure generated fields to be connected by separators and exported to a file.
Table Output: Configure the mapping between output fields and specified columns in the database.
HBase Output: Configure the generated fields to the columns of the HBase table.
Hive Output: Configure generated fields to a column of a Hive table.
Spark Output: Configure generated fields to the columns of SparkSQL tables. Only SparkSQL can access Hive data.
The edit box allows you to perform the following tasks:
Re-command: Rename an example.
Edit: Edit the step conversion by referring to 6.c.
Delete: Delete an example.
Note
You can also use the shortcut key Del to delete the file.
Click Edit to edit the step conversion information and configure fields and data.
For details about how to set parameters in the step conversion information, see Operator Help.
If the conversion step is incorrectly configured, the source data cannot be converted and become dirty data. The dirty data marking rules are as follows:
In any input type step, the number of fields contained in the original data is less than the number of configured fields or the field values in the original data do not match the configured field type.
In the CSV File Input step, Validate input field checks whether the input field matches the value type. If the input field and value type of a line do not match, the line is skipped and becomes dirty data.
In the Fixed Width File Input step, Fixed Length specifies the field splitting length. If the length is greater than the length of the original field value, data splitting fails and the current line becomes dirty data.
In the HBase Input step, if the HBase table name specified by HBase Table Name is incorrect, or no primary key column is configured for Primary Key, all data becomes dirty data.
In any conversion step, lines whose conversion fails becomes dirty data. For example, in the Split Conversion step, the number of generated fields is less than the number of configured fields, or the original data cannot be converted to the String type, and the current row becomes dirty data.
In the Filter Row Conversion step, rows filtered by filter criteria become dirty data.
In the Modulo Conversion step, if the original field value is NULL, the current row becomes dirty data.
Click Next.
In the 4. Output Settings area, set the destination location for saving data and click Save to save the job or click Save and Run to save and run the job.
¶ Data Connection Type
Parameter
Description
sftp-connector
Output Path
Path or name of the export file on an SFTP server. If multiple SFTP server IP addresses are configured for the connector, you can set this parameter to multiple paths or file names separated with semicolons (;). Ensure that the number of input paths or file names is the same as the number of SFTP servers configured for the connector.
Operation
Specifies the action during data import. When all data is to be imported from the input path to the destination path, the data is stored in a temporary directory and then copied from the temporary directory to the destination path. After the data is imported successfully, the data is deleted from the temporary directory. One of the following actions can be taken when duplicate file names exist during data transfer:
OVERRIDE: overrides the old file.
RENAME: renames as new file. For a file without an extension, a string is added to the file name as the extension; for a file with an extension, a string is added to the extension. The string is unique.
APPEND: adds the content of the new file to the end of the old file. This action only adds content regardless of whether the file can be used. For example, a text file can be used after this operation, while a compressed file cannot.
IGNORE: reserves the old file and does not copy the new file.
ERROR: stops the task and reports an error if duplicate file names exist. Transferred files are imported successfully, while files that have duplicate names and files that are not transferred fail to be imported.
Encode type
Specifies the exported file encoding format, for example, UTF-8. This parameter can be set only in text file export.
Compression
Indicates whether to enable the compressed transmission function when SFTP is used to export data. true indicates that compression is enabled, and false indicates that compression is disabled.
hdfs-connector
Output Path
Specifies the output directory or file name of the export file in the HDFS/OBS.
File Format
Specifies the file export type.
TEXT_FILE: imports a text file and stores it as a text file.
SEQUENCE_FILE: imports a text file and stores it as a sequence file.
BINARY_FILE: imports files of any format by using binary streams.
Compression codec
Specifies the compression format of files exported to HDFS/OBS. Select a format from the drop-down list. If you select NONE or do not set this parameter, data is not compressed.
User-defined compression format
Name of a user-defined compression format type.
generic-jdbc-connector
Schema name
Specifies the database schema name.
Table name
Specifies the name of a database table that is used to save the final data of the transmission.
Temporary table
Specifies the name of a temporary database table that is used to save temporary data during the transmission. The fields in the table must be the same as those in the database specified by Table name.
oracle-partition-connector
Schema Name
Specifies the database schema name.
Table Name
Specifies the name of a database table that is used to save the final data of the transmission.
Temporary Table
Specifies the name of a temporary database table that is used to save temporary data during the transmission. The fields in the table must be the same as those in the database specified by Table name.
oracle-connector
Table Name
Destination table name to store data.
Column Name
Specifies the name of the column to be written. Columns that are not specified can be set to null or the default value.
mysql-fastpath-connector
Schema Name
Specifies the database schema name.
Table Name
Specifies the name of a database table that is used to save the final data of the transmission.
Temporary Table Name
Name of the temporary table, which is used to store data. After the job is successfully executed, data is transferred to the formal table.
On the Loader WebUI page, you can view, start, stop, copy, delete, edit, and view historical information about created jobs.