Common Configuration Items of Batch SQL Jobs¶

This section describes the common configuration items of the SQL syntax for DLI batch jobs.

**Table 1** Common configuration items¶
Item	Default Value	Description
spark.sql.files.maxRecordsPerFile	0	Maximum number of records to be written into a single file. If the value is zero or negative, there is no limit.
spark.sql.shuffle.partitions	200	Default number of partitions used to filter data for join or aggregation.
spark.sql.dynamicPartitionOverwrite.enabled	false	Whether DLI overwrites the partitions where data will be written into during runtime. If you set this parameter to false, all partitions that meet the specified condition will be deleted before data overwrite starts. For example, if you set false and use INSERT OVERWRITE to write partition 2021-02 to a partitioned table that has the 2021-01 partition, this partition will be deleted. If you set this parameter to true, DLI does not delete partitions before overwrite starts.
spark.sql.files.maxPartitionBytes	134217728	Maximum number of bytes to be packed into a single partition when a file is read.
spark.sql.badRecordsPath	`-`	Path of bad records.
dli.sql.sqlasync.enabled	true	Whether DDL and DCL statements are executed asynchronously. The value true indicates that asynchronous execution is enabled.
dli.sql.job.timeout	`-`	Job running timeout interval, in seconds. If the job times out, it will be canceled.
spark.sql.keep.distinct.expandThreshold	`-`	Parameter description: When running queries with multidimensional analysis that include the count(distinct) function using the cube structure in Spark, the typical execution plan involves using the expand operator. However, this operation can cause query inflation. To avoid this issue, you are advised to configure the following settings: spark.sql.keep.distinct.expandThreshold: Default value: -1, indicating that Spark's default expand operator is used. Setting the parameter to a specific value, such as 512, defines the threshold for query inflation. If the threshold is exceeded, the count(distinct) function will use the distinct aggregation operator to execute the query instead of the expand operator. spark.sql.distinct.aggregator.enabled: whether to forcibly use the distinct aggregation operator. If set to true, spark.sql.keep.distinct.expandThreshold is not used. Use case: Queries with multidimensional analysis that use the cube structure and may include multiple count(distinct) functions, as well as the cube or rollup operator. Example of a typical use case: SELECT a1, a2, count(distinct b), count(distinct c) FROM test_distinct group by a1, a2 with cube
spark.sql.distinct.aggregator.enabled	false
spark.sql.dli.job.shareLevel	Queue	This configuration item is used to set the isolation level of SQL statements. Different isolation levels (job, user, project, queue) determine whether SQL jobs are executed by independent Spark Drivers and Executors or share existing ones. job: Each SQL job will independently start a Spark Driver and a set of Executors for execution. This is suitable for jobs that require complete isolation, ensuring that each job's execution environment is entirely independent. user: If a Spark Driver started by this user already exists and can continue submitting tasks, the new SQL job will be submitted to this existing Driver for execution. If there is no existing Driver or the current Driver cannot continue submitting tasks, a new Spark Driver will be started for this user. This is suitable for scenarios where multiple jobs from the same user need to share resources. project: If a Spark Driver started by this project already exists and can continue submitting tasks, the new SQL job will be submitted to this existing Driver for execution. If there is no existing Driver or the current Driver cannot continue submitting tasks, a new Spark Driver will be started for this project. This is suitable for scenarios where multiple jobs within the same project need to share resources. queue: If a Spark Driver started by this queue already exists and can continue submitting tasks, the new SQL job will be submitted to this existing Driver for execution. If there is no existing Driver or the current Driver cannot continue submitting tasks, a new Spark Driver will be started for this queue. This is suitable for scenarios where resources are managed by queues, allowing for more granular control over resource allocation. Note The maximum number of Spark Drivers that can be started (maximum Spark Driver instances) and the maximum number of concurrent SQL queries that can be executed by each Spark Driver (maximum concurrency per Spark Driver instance) can be configured in the queue properties.

last updated: 2025-09-26 14:01 UTC - commit: b678b8de99e440a05c547c994e0d5173f2a868c9