The distribution column in a hash table must meet the following requirements, which are ranked by priority in descending order::
For a hash table, an improper distribution key may cause data skew or poor I/O performance on certain DNs. Therefore, you need to check the table to ensure that data is evenly distributed on each DN. You can run the following SQL statements to check data skew:
select xc_node_id, count(1) from tablename group by xc_node_id order by xc_node_id desc;
xc_node_id corresponds to a DN. Generally, over 5% difference between the amount of data on different DNs is regarded as data skew. If the difference is over 10%, choose another distribution column.
Multiple distribution columns can be selected in DWS to evenly distribute data.