Configuring Parameters Rapidly

Overview

This section describes how to quickly configure common parameters and lists parameters that are not recommended to be modified when Spark2x is used.

Common parameters to be configured

Some parameters have been adapted during cluster installation. However, the following parameters need to be adjusted based on application scenarios. Unless otherwise specified, the following parameters are configured in the spark-defaults.conf file on the Spark2x client.

Table 1 Common parameters to be configured

Configuration Item

Description

Default Value

spark.sql.parquet.compression.codec

Used to set the compression format of a non-partitioned Parquet table.

Set the queue in the spark-defaults.conf configuration file on the JDBCServer server.

snappy

spark.dynamicAllocation.enabled

Indicates whether to use dynamic resource scheduling, which is used to adjust the number of executors registered with the application according to scale. Currently, this parameter is valid only in Yarn mode.

The default value for JDBCServer is true, and that for the client is false.

false

spark.executor.memory

Indicates the memory size used by each executor process. Its character sting is in the same format as the JVM memory (example: 512 MB or 2 GB).

4G

spark.sql.autoBroadcastJoinThreshold

Indicates the maximum value for the broadcast configuration when two tables are joined.

  • When the size of a field in a table involved in an SQL statement is less than the value of this parameter, the system broadcasts the SQL statement.

  • If the value is set to -1, broadcast is not performed.

10485760

spark.yarn.queue

Specifies the Yarn queue where JDBCServer resides.

Set the queue in the spark-defaults.conf configuration file on the JDBCServer server.

default

spark.driver.memory

In a large cluster, you are advised to configure the memory used by the 32 GB to 64 GB driver process, that is, the SparkContext initialization process (for example, 512 MB and 2 GB).

4G

spark.yarn.security.credentials.hbase.enabled

Indicates whether to enable the function of obtaining HBase tokens. If the Spark on HBase function is required and a security cluster is configured, set this parameter to true. Otherwise, set this parameter to false.

false

spark.serializer

Used to serialize the objects that are sent over the network or need to be cached.

The default value of Java serialization applies to any Serializable Java object, but the running speed is slow. Therefore, you are advised to use org.apache.spark.serializer.KryoSerializer and configure Kryo serialization. It can be any subclass of org.apache.spark.serializer.Serializer.

org.apache.spark.serializer.JavaSerializer

spark.executor.cores

Indicates the number of kernels used by each executor.

Set this parameter in standalone mode and Mesos coarse-grained mode. When there are sufficient kernels, the application is allowed to execute multiple executable programs on the same worker. Otherwise, each application can run only one executable program on each worker.

1

spark.shuffle.service.enabled

Indicates a long-term auxiliary service in NodeManager for improving shuffle computing performance.

false

spark.sql.adaptive.enabled

Indicates whether to enable the adaptive execution framework.

false

spark.executor.memoryOverhead

Indicates the heap memory to be allocated to each executor, in MB.

This is the memory that occupies the overhead of the VM, similar to the internal string and other built-in overhead. The value increases with the executor size (usually 6% to 10%).

1GB

spark.streaming.kafka.direct.lifo

Indicates whether to enable the LIFO function of Kafka.

false