Clustering Configuration¶
This section applies only to MRS 3.2.0 or later.
Clustering has two strategies: hoodie.clustering.plan.strategy.class and hoodie.clustering.execution.strategy.class. Typically, if hoodie.clustering.plan.strategy.class is set to SparkRecentDaysClusteringPlanStrategy or SparkSizeBasedClusteringPlanStrategy, hoodie.clustering.execution.strategy.class does not need to be specified. However, if hoodie.clustering.plan.strategy.class is set to SparkSingleFileSortPlanStrategy, hoodie.clustering.execution.strategy.class must be set to SparkSingleFileSortExecutionStrategy.
Parameter | Description | Default Value |
hoodie.clustering.inline | Whether to execute clustering synchronously | false |
hoodie.clustering.inline.max.commits | Number of commits that trigger clustering | 4 | | Maximum size of each file after clustering | 1024 * 1024 * 1024 byte |
hoodie.clustering.plan.strategy.small.file.limit | Files smaller than this size will be clustered. | 300 * 1024 * 1024 byte |
hoodie.clustering.plan.strategy.sort.columns | Columns used for sorting in clustering | None |
hoodie.layout.optimize.strategy | Clustering execution strategy. Three sorting modes are available: linear, z-order, and hilbert. | linear |
hoodie.layout.optimize.enable | Set this parameter to true when z-order or hilbert is used. | false |
hoodie.clustering.plan.strategy.class | Strategy class for filtering file groups for clustering. By default, files whose size is less than the value of hoodie.clustering.plan.strategy.small.file.limit are filtered. | org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy |
hoodie.clustering.execution.strategy.class | Strategy class for executing clustering (subclass of RunClusteringStrategy), which is used to define the execution mode of a cluster plan. The default classes sort the file groups in the plan by the specified column and meets the configured target file size. | |
hoodie.clustering.plan.strategy.max.num.groups | Maximum number of file groups that can be selected during clustering. A larger value indicates a higher concurrency. | 30 | | Maximum number of data records in each file group involved in clustering | 2 * 1024 * 1024 * 1024 byte |