• MapReduce Service

mrs
  1. Help Center
  2. MapReduce Service
  3. User Guide
  4. Using MRS
  5. Using CarbonData
  6. Creating a CarbonData Table

Creating a CarbonData Table

Scenario

A CarbonData table must be created to load and query data.

Creating a Table with Self-Defined Columns

Users can create a table by specifying its columns and data types. For analysis clusters with Kerberos authentication enabled, if a user wants to create a CarbonData table in a database other than the default database, the Create permission of the database must be added to the role that the user is bound to in Hive role management.

Command example:

CREATE TABLE IF NOT EXISTS productdb.productSalesTable (

productNumber Int,

productName String,

storeCity String,

storeProvince String,

revenue Int)

STORED BY 'org.apache.carbondata.format'

TBLPROPERTIES (

'table_blocksize'='128',

'DICTIONARY_EXCLUDE'='productName',

'DICTIONARY_INCLUDE'='productNumber');

The following table describes the command parameters.

Table 1 Parameter description

Parameter

Description

productSalesTable

Indicates the table name. The table is used to load data for analysis.

The table name consists of letters, digits, and underscores (_).

productdb

Indicates the database name. The database maintains logical connections with tables that it stores to identify and manage the tables.

The database name consists of letters, digits, and underscores (_).

productNumber

productName

storeCity

storeProvince

revenue

Indicate columns in the table. The columns are service entities for data analysis.

The column name (field name) consists of letters, digits, and underscores (_).

table_blocksize

Indicates the block size of data files used by the CarbonData table. The value ranges from 1 MB to 2048 MB. The default is 1024 MB.

  • If table_blocksize is too small, a large number of small files will be generated when data is loaded. This may affect the performance of HDFS.
  • If table_blocksize is too large, a large volume of data must be read from a block and the read concurrency is low when data is queried. As a result, the query performance deteriorates.

It is advised to set the block size based on the data volume. For example, set the block size to 256 MB for GB-level data, 512 MB for TB-level data, and 1024 MB for PB-level data.

DICTIONARY_EXCLUDE

Specifies the columns that do not generate dictionaries. This function is optional and applicable to columns of high complexity. By default, the system generates dictionaries for columns of the String type. However, as the number of values in the dictionaries increases, conversion operations by the dictionaries increase and the system performance deteriorates.

Generally, if a column has over 50,000 unique data records, it is considered as a highly complex column and dictionary generation must be disabled.

NOTE:

Non-dictionary columns support only the String and Timestamp data types.

DICTIONARY_INCLUDE

Specifies the columns that generate dictionaries. This function is optional and applicable to columns of low complexity (with fewer than 50,000 unique data records). It improves the performance of queries with the groupby condition.