• Data Warehouse Service

  1. Help Center
  2. Data Warehouse Service
  3. Developer Guide
  4. Best Practices
  5. Best Practices of Table Design
  6. Selecting a Distribution Column

Selecting a Distribution Column

The distribution column in a hash table must meet the following requirements, which are ranked by priority in descending order::

  1. The value of the distribution column should be discrete so that data can be evenly distributed on each DN. For example, you are advised to select the primary key of a table as the distribution column, and the ID card number as the distribution column in a personnel information table.
  2. Do not select the column where a constant filter exists. For example, if a constant constraint (for example, zqdh= '000001') exists on the zqdh column in some queries on the dwcjk table, you are not advised to use zqdh as the distribution column.
  3. Select the join condition as the distribution column, so that JOIN tasks can be pushed down to DNs to execute, and that data transfered among DNs can be decreased.

For a hash table, an improper distribution key may cause data skew or poor I/O performance on certain DNs. Therefore, you need to check the table to ensure that data is evenly distributed on each DN. You can run the following SQL statements to check data skew:

xc_node_id, count(1) 
from tablename 
group by xc_node_id 
order by xc_node_id desc;

xc_node_id corresponds to a DN. Generally, over 5% difference between the amount of data on different DNs is regarded as data skew. If the difference is over 10%, choose another distribution column.

Multiple distribution columns can be selected in DWS to evenly distribute data.