• Data Warehouse Service

  1. Help Center
  2. Data Warehouse Service
  3. Developer Guide
  4. Query Performance Optimization
  5. Tuning Queries
  6. Typical SQL Optimization Methods
  7. Data Skew Optimization

Data Skew Optimization

Data Skew Occurred in the Storage-Layer

In the DWS database, data is distributed and stored on each DN. You can improve the query efficiency by using distributed execution. However, if data skew occurs, bottlenecks exist on some DNs during distribution execution, affecting the query performance. This is because the distribution column is not properly selected. This can be solved by adjusting the distribution column.

For example:

explain performance select count(*) from inventory;
5 --CStore Scan on lmz.inventory
         dn_6001_6002 (actual time=0.444..83.127 rows=42000000 loops=1)
         dn_6003_6004 (actual time=0.512..63.554 rows=27000000 loops=1)
         dn_6005_6006 (actual time=0.722..99.033 rows=45000000 loops=1)
         dn_6007_6008 (actual time=0.529..100.379 rows=51000000 loops=1)
         dn_6009_6010 (actual time=0.382..71.341 rows=36000000 loops=1)
         dn_6011_6012 (actual time=0.547..100.274 rows=51000000 loops=1)
         dn_6013_6014 (actual time=0.596..118.289 rows=60000000 loops=1)
         dn_6015_6016 (actual time=1.057..132.346 rows=63000000 loops=1)
         dn_6017_6018 (actual time=0.940..110.310 rows=54000000 loops=1)
         dn_6019_6020 (actual time=0.231..41.198 rows=21000000 loops=1)
         dn_6021_6022 (actual time=0.927..114.538 rows=54000000 loops=1)
         dn_6023_6024 (actual time=0.637..118.385 rows=60000000 loops=1)
         dn_6025_6026 (actual time=0.288..32.240 rows=15000000 loops=1)
         dn_6027_6028 (actual time=0.566..118.096 rows=60000000 loops=1)
         dn_6029_6030 (actual time=0.423..82.913 rows=42000000 loops=1)
         dn_6031_6032 (actual time=0.395..78.103 rows=39000000 loops=1)
         dn_6033_6034 (actual time=0.376..51.052 rows=24000000 loops=1)
         dn_6035_6036 (actual time=0.569..79.463 rows=39000000 loops=1)

In the performance information, you can view the number of scan rows of each DN in the inventory table. The number of rows of each DN differs a lot, the biggest is 63,000,000 and the smallest value is 15,000,000. This value difference on the performance of data scan is acceptable, but if the join operator exists in the upper-layer, the impact on the performance cannot be ignored.

Generally, the data table is hash distributed on each DN; therefore, it is important to choose a proper distribution column. Run table_skewness() to view data skew of each DN in the inventory table. The query result is as follows:

select table_skewness('inventory');
 ("dn_6015_6016        ",63000000,8.046%)
 ("dn_6013_6014        ",60000000,7.663%)
 ("dn_6023_6024        ",60000000,7.663%)
 ("dn_6027_6028        ",60000000,7.663%)
 ("dn_6017_6018        ",54000000,6.897%)
 ("dn_6021_6022        ",54000000,6.897%)
 ("dn_6007_6008        ",51000000,6.513%)
 ("dn_6011_6012        ",51000000,6.513%)
 ("dn_6005_6006        ",45000000,5.747%)
 ("dn_6001_6002        ",42000000,5.364%)
 ("dn_6029_6030        ",42000000,5.364%)
 ("dn_6031_6032        ",39000000,4.981%)
 ("dn_6035_6036        ",39000000,4.981%)
 ("dn_6009_6010        ",36000000,4.598%)
 ("dn_6003_6004        ",27000000,3.448%)
 ("dn_6033_6034        ",24000000,3.065%)
 ("dn_6019_6020        ",21000000,2.682%)
 ("dn_6025_6026        ",15000000,1.916%)
(18 rows)

This section describes how to query the table creation, you can find that defined the table uses inv_date_sk as the distribution column. As a result, no data skew occurs. You can view the data distribution of each column, we to inv_item_sk as distribution column, the slant distribution are as follows:

select table_skewness('inventory');
 ("dn_6001_6002        ",43934200,5.611%)
 ("dn_6007_6008        ",43829420,5.598%)
 ("dn_6003_6004        ",43781960,5.592%)
 ("dn_6031_6032        ",43773880,5.591%)
 ("dn_6033_6034        ",43763280,5.589%)
 ("dn_6011_6012        ",43683600,5.579%)
 ("dn_6013_6014        ",43551660,5.562%)
 ("dn_6027_6028        ",43546340,5.561%)
 ("dn_6009_6010        ",43508700,5.557%)
 ("dn_6023_6024        ",43484540,5.554%)
 ("dn_6019_6020        ",43466800,5.551%)
 ("dn_6021_6022        ",43458500,5.550%)
 ("dn_6017_6018        ",43448040,5.549%)
 ("dn_6015_6016        ",43247700,5.523%)
 ("dn_6005_6006        ",43200240,5.517%)
 ("dn_6029_6030        ",43181360,5.515%)
 ("dn_6025_6026        ",43179700,5.515%)
 ("dn_6035_6036        ",42960080,5.487%)
(18 rows)

Data is distributed for tilt of the problem is solved.

The intermediate results. As a result, the data skew

The data on the storage layer is balanced. However, the operation of a certain operator on the DN output during appears in the result set of deflection. As a result, the operator of the upper-layer operation as it encounters computing skew. In the following example, the output of the stream operator whose ID is 5 has severe skew on each DN, and the data volume output from the dn_6009_6010 DN is 2.6 times that of other nodes. As a result, the stream operator and upper-layer operators have serious computing skew.

Data skew caused by the medium results can be avoided by rewriting SQL statements. If SQL statements rewriting cannot solve this problem, you need to consider the service layer. In the preceding example, the query can be changed as follows:

sum(t1_tmp.count * t2_tmp.count) 
from (select count(1) as count, a from t1 group by a) t1_tmp, 
(select count(1) as count, b from t2 group by b) t2_tmp 
where t2_tmp.b = t1_tmp.a;

If the execution time after rewriting is shortened to 320.583 ms, the performance is improved by 60%.