Clustering Configuration

Note

This section applies only to MRS 3.2.0 or later.

Clustering has two strategies: hoodie.clustering.plan.strategy.class and hoodie.clustering.execution.strategy.class. Typically, if hoodie.clustering.plan.strategy.class is set to SparkRecentDaysClusteringPlanStrategy or SparkSizeBasedClusteringPlanStrategy, hoodie.clustering.execution.strategy.class does not need to be specified. However, if hoodie.clustering.plan.strategy.class is set to SparkSingleFileSortPlanStrategy, hoodie.clustering.execution.strategy.class must be set to SparkSingleFileSortExecutionStrategy.

Parameter

Description

Default Value

hoodie.clustering.inline

Whether to execute clustering synchronously

false

hoodie.clustering.inline.max.commits

Number of commits that trigger clustering

4

hoodie.clustering.plan.strategy.target.file.max.bytes

Maximum size of each file after clustering

1024 * 1024 * 1024 byte

hoodie.clustering.plan.strategy.small.file.limit

Files smaller than this size will be clustered.

300 * 1024 * 1024 byte

hoodie.clustering.plan.strategy.sort.columns

Columns used for sorting in clustering

None

hoodie.layout.optimize.strategy

Clustering execution strategy. Three sorting modes are available: linear, z-order, and hilbert.

linear

hoodie.layout.optimize.enable

Set this parameter to true when z-order or hilbert is used.

false

hoodie.clustering.plan.strategy.class

Strategy class for filtering file groups for clustering. By default, files whose size is less than the value of hoodie.clustering.plan.strategy.small.file.limit are filtered.

org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy

hoodie.clustering.execution.strategy.class

Strategy class for executing clustering (subclass of RunClusteringStrategy), which is used to define the execution mode of a cluster plan.

The default classes sort the file groups in the plan by the specified column and meets the configured target file size.

org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy

hoodie.clustering.plan.strategy.max.num.groups

Maximum number of file groups that can be selected during clustering. A larger value indicates a higher concurrency.

30

hoodie.clustering.plan.strategy.max.bytes.per.group

Maximum number of data records in each file group involved in clustering

2 * 1024 * 1024 * 1024 byte