This section describes the procedure of using Spark CarbonData. All tasks are based on the Spark-beeline environment. The following tasks are included:
Before performing any operation on CarbonData, users must connect CarbonData to Spark.
After CarbonData connects to Spark, users must create a CarbonData table to load and query data.
Users load data from CSV files in HDFS to the CarbonData table.
After data is loaded to the CarbonData table, users can run query commands such as groupby and where.
The client has been updated.
For example, if you have updated the client on the Master2 node, log in to the Master2 node to use the client. For details, see Client Management.
sudo su - omm
kinit Spark username
The user must be added to the hive group.
Run the following command to create a CarbonData table, which is used to load and query data.
CREATE TABLE x1 (imei string, deviceInformationId int, mac string, productdate timestamp, updatetime timestamp, gamePointId double, contractNumber double)
STORED BY 'org.apache.carbondata.format'
+---------+--+ | result | +---------+--+ +---------+--+ No rows selected (1.551 seconds)
Currently, only CSV files are supported. The CSV column names specified in the LOAD command must be the same and in the same sequence as the column names in the CarbonData table. The data formats and number of data columns in the CSV files must also be the same as those in the CarbonData table.
The CSV files must be stored on HDFS. Users can upload the files to OBS and import them from OBS to HDFS on the File Management page of the MRS management console. If Kerberos authentication is enabled, prepare the CSV files in the work environment and import them to HDFS using open-source HDFS commands. In addition, assign the Spark user with the read and execute permissions of the files on HDFS.
For example, the data.csv file is saved in the tmp directory of HDFS and has the following contents:
x123,111,dd,2017-04-20 08:51:27,2017-04-20 07:56:51,2222,33333
The command for loading data from the file is as follows:
LOAD DATA inpath 'hdfs://hacluster/tmp/data.csv' into table x1 options('DELIMITER'=',','QUOTECHAR'='"','FILEHEADER'='imei, deviceinformationid,mac,productdate,updatetime,gamepointid,contractnumber');
+---------+--+ | Result | +---------+--+ +---------+--+ No rows selected (3.039 seconds)
Run the following command to obtain the number of records in the CarbonData table:
select count(*) from x1;
Run the following command to obtain the deviceinformationid records without repetition in the CarbonData table:
select deviceinformationid,count (distinct deviceinformationid) from x1 group by deviceinformationid;
Run the following command to obtain specific deviceinformationid records:
select * from x1 where deviceinformationid='111';
If the query result has Chinese or other non-English characters, the columns in the query result may not be aligned. This is because characters of different languages occupy different widths.