Configuring Parameters Rapidly¶
Overview¶
This section describes how to quickly configure common parameters and lists parameters that are not recommended to be modified when Spark2x is used.
Common parameters to be configured¶
Some parameters have been adapted during cluster installation. However, the following parameters need to be adjusted based on application scenarios. Unless otherwise specified, the following parameters are configured in the spark-defaults.conf file on the Spark2x client.
Configuration Item | Description | Default Value |
---|---|---|
spark.sql.parquet.compression.codec | Used to set the compression format of a non-partitioned Parquet table. Set the queue in the spark-defaults.conf configuration file on the JDBCServer server. | snappy |
spark.dynamicAllocation.enabled | Indicates whether to use dynamic resource scheduling, which is used to adjust the number of executors registered with the application according to scale. Currently, this parameter is valid only in Yarn mode. The default value for JDBCServer is true, and that for the client is false. | false |
spark.executor.memory | Indicates the memory size used by each executor process. Its character sting is in the same format as the JVM memory (example: 512 MB or 2 GB). | 4G |
spark.sql.autoBroadcastJoinThreshold | Indicates the maximum value for the broadcast configuration when two tables are joined.
| 10485760 |
spark.yarn.queue | Specifies the Yarn queue where JDBCServer resides. Set the queue in the spark-defaults.conf configuration file on the JDBCServer server. | default |
spark.driver.memory | In a large cluster, you are advised to configure the memory used by the 32 GB to 64 GB driver process, that is, the SparkContext initialization process (for example, 512 MB and 2 GB). | 4G |
spark.yarn.security.credentials.hbase.enabled | Indicates whether to enable the function of obtaining HBase tokens. If the Spark on HBase function is required and a security cluster is configured, set this parameter to true. Otherwise, set this parameter to false. | false |
spark.serializer | Used to serialize the objects that are sent over the network or need to be cached. The default value of Java serialization applies to any Serializable Java object, but the running speed is slow. Therefore, you are advised to use org.apache.spark.serializer.KryoSerializer and configure Kryo serialization. It can be any subclass of org.apache.spark.serializer.Serializer. | org.apache.spark.serializer.JavaSerializer |
spark.executor.cores | Indicates the number of kernels used by each executor. Set this parameter in standalone mode and Mesos coarse-grained mode. When there are sufficient kernels, the application is allowed to execute multiple executable programs on the same worker. Otherwise, each application can run only one executable program on each worker. | 1 |
spark.shuffle.service.enabled | Indicates a long-term auxiliary service in NodeManager for improving shuffle computing performance. | false |
spark.sql.adaptive.enabled | Indicates whether to enable the adaptive execution framework. | false |
spark.executor.memoryOverhead | Indicates the heap memory to be allocated to each executor, in MB. This is the memory that occupies the overhead of the VM, similar to the internal string and other built-in overhead. The value increases with the executor size (usually 6% to 10%). | 1GB |
spark.streaming.kafka.direct.lifo | Indicates whether to enable the LIFO function of Kafka. | false |
Parameters Not Recommended to Be Modified¶
The following parameters have been adapted during cluster installation. You are not advised to modify them.
Configuration Item | Description | Default Value or Configuration Example |
---|---|---|
spark.password.factory | Selects the password parsing mode. | org.apache.spark.om.util.FIPasswordFactory |
spark.ssl.ui.protocol | Sets the SSL protocol of the UI. | TLSv1.2 |
spark.yarn.archive | Archives Spark JAR files, which are distributed to Yarn cache. If this parameter is set, the value will replace <code> spark.yarn.jars </code> and be archived in the containers of all applications. The archive should contain the JAR files in its root directory. Archives can also be hosted on HDFS to speed up file distribution. | hdfs://hacluster/user/spark2x/jars/8.1.2.2/spark-archive-2x.zip Note The version 8.1.2.2 is used as an example. Replace it with the actual version number. |
spark.yarn.am.extraJavaOptions | Indicates a string of extra JVM options to pass to the YARN ApplicationMaster in client mode. Use spark.driver.extraJavaOptions in cluster mode. | -Dlog4j.configuration=./__spark_conf__/__hadoop_conf__/log4j-executor.properties -Djava.security.auth.login.config=./__spark_conf__/__hadoop_conf__/jaas-zk.conf -Dzookeeper.server.principal=zookeeper/hadoop.<system domain name> -Djava.security.krb5.conf=./__spark_conf__/__hadoop_conf__/kdc.conf -Djdk.tls.ephemeralDHKeySize=2048 |
spark.shuffle.servicev2.port | Indicates the port for the shuffle service to monitor requests for obtaining data. | 27338 |
spark.ssl.historyServer.enabled | Sets whether the history server uses SSL. | true |
spark.files.overwrite | When the target file exists and its content does not match that of the source file, whether to overwrite the file added through SparkContext.addFile(). | false |
spark.yarn.cluster.driver.extraClassPath | Indicates the extraClassPath of the driver in Yarn-cluster mode. Set the parameter to the path and parameters of the server. | ${BIGDATA_HOME}/common/runtime/security |
spark.driver.extraClassPath | Indicates the extra class path entries attached to the class path of the driver. | ${BIGDATA_HOME}/common/runtime/security |
spark.yarn.dist.innerfiles | Sets the files that need to be uploaded to HDFS from Spark in Yarn mode. | /Spark_path/spark/conf/s3p.file,/Spark_path/spark/conf/locals3.jceks Spark_path is the installation path of the Spark client. |
spark.sql.bigdata.register.dialect | Registers the SQL parser. | org.apache.spark.sql.hbase.HBaseSQLParser |
spark.shuffle.manager | Indicates the data processing mode. There are two implementation modes: sort and hash. The sort shuffle has a higher memory utilization. It is the default option in Spark 1.2 and later versions. | SORT |
spark.deploy.zookeeper.url | Indicates the address of ZooKeeper. Multiple addresses are separated by commas (,). | For example: host1:2181,host2:2181,host3:2181 |
spark.broadcast.factory | Indicates the broadcast mode. | org.apache.spark.broadcast.TorrentBroadcastFactory |
spark.sql.session.state.builder | Session state constructor. | org.apache.spark.sql.hive.FIHiveACLSessionStateBuilder |
spark.executor.extraLibraryPath | Sets the special library path used when the executor JVM is started. | ${BIGDATA_HOME}/FusionInsight_HD_8.1.2.2/install/FusionInsight-Hadoop-3.1.1/hadoop/lib/native |
spark.ui.customErrorPage | Indicates whether to display the custom error information page when an error occurs on the page. | true |
spark.httpdProxy.enable | Indicates whether to use the httpd proxy. | true |
spark.ssl.ui.enabledAlgorithms | Sets the SSL algorithm of UI. | TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_DHE_RSA_WITH_AES_256_GCM_SHA384,TLS_DHE_DSS_WITH_AES_256_GCM_SHA384,TLS_DHE_RSA_WITH_AES_128_GCM_SHA256,TLS_DHE_DSS_WITH_AES_128_GCM_SHA256 |
spark.ui.logout.enabled | Sets the logout button for the web UI of the Spark component. | true |
spark.security.hideInfo.enabled | Indicates whether to hide sensitive information on the UI. | true |
spark.yarn.cluster.driver.extraLibraryPath | Indicates the extraLibraryPath of the driver in Yarn-cluster mode. Set this parameter to the path and parameters of the server. | ${BIGDATA_HOME}/FusionInsight_HD_8.1.2.2/install/FusionInsight-Hadoop-3.1.1/hadoop/lib/native |
spark.driver.extraLibraryPath | Sets a special library path for starting the driver JVM. | ${DATA_NODE_INSTALL_HOME}/hadoop/lib/native |
spark.ui.killEnabled | Allows stages and jobs to be stopped on the web UI. | true |
spark.yarn.access.hadoopFileSystems | Spark can access multiple NameService instances. If there are multiple NameService instances, set this parameter to all the NameService instances and separate them with commas (,). | hdfs://hacluster,hdfs://hacluster |
spark.yarn.cluster.driver.extraJavaOptions | Indicates extra JVM option passed to the executor, for example, GC setting and logging. Do not set Spark attributes or heap size using this option. Instead, set Spark attributes using the SparkConf object or the spark-defaults.conf file specified when the spark-submit script is called. Set heap size using spark.executor.memory. | -Xloggc:<LOG_DIR>/gc.log -XX:+PrintGCDetails -XX:-OmitStackTraceInFastThrow -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=20 -XX:GCLogFileSize=10M -Dlog4j.configuration=./__spark_conf__/__hadoop_conf__/log4j-executor.properties -Djava.security.auth.login.config=./__spark_conf__/__hadoop_conf__/jaas-zk.conf -Dzookeeper.server.principal=zookeeper/hadoop.<system domain name> -Djava.security.krb5.conf=./__spark_conf__/__hadoop_conf__/kdc.conf -Djetty.version=x.y.z -Dorg.xerial.snappy.tempdir=${BIGDATA_HOME}/tmp/spark2x_app -Dcarbon.properties.filepath=./__spark_conf__/__hadoop_conf__/carbon.properties -Djdk.tls.ephemeralDHKeySize=2048 |
spark.driver.extraJavaOptions | Indicates a series of extra JVM options passed to the driver, | -Xloggc:${SPARK_LOG_DIR}/indexserver-omm-%p-gc.log -XX:+PrintGCDetails -XX:-OmitStackTraceInFastThrow -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:MaxDirectMemorySize=512M -XX:MaxMetaspaceSize=512M -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=20 -XX:GCLogFileSize=10M -XX:OnOutOfMemoryError='kill -9 %p' -Djetty.version=x.y.z -Dorg.xerial.snappy.tempdir=${BIGDATA_HOME}/tmp/spark2x/JDBCServer/snappy_tmp -Djava.io.tmpdir=${BIGDATA_HOME}/tmp/spark2x/JDBCServer/io_tmp -Dcarbon.properties.filepath=${SPARK_CONF_DIR}/carbon.properties -Djdk.tls.ephemeralDHKeySize=2048 -Dspark.ssl.keyStore=${SPARK_CONF_DIR}/child.keystore #{java_stack_prefer} |
spark.eventLog.overwrite | Indicates whether to overwrite any existing file. | false |
spark.eventLog.dir | Indicates the directory for logging Spark events if spark.eventLog.enabled is set to true. In this directory, Spark creates a subdirectory for each application and logs events of the application in the subdirectory. You can also set a unified address similar to the HDFS directory so that the History Server can read historical files. | hdfs://hacluster/spark2xJobHistory2x |
spark.random.port.min | Sets the minimum random port. | 22600 |
spark.authenticate | Indicates whether Spark authenticates its internal connections. If the application is not running on Yarn, see spark.authenticate.secret. | true |
spark.random.port.max | Sets the maximum random port. | 22899 |
spark.eventLog.enabled | Indicates whether to log Spark events, which are used to reconstruct the web UI after the application execution is complete. | true |
spark.executor.extraJavaOptions | Indicates extra JVM option passed to the executor, for example, GC setting and logging. Do not set Spark attributes or heap size using this option. | -Xloggc:<LOG_DIR>/gc.log -XX:+PrintGCDetails -XX:-OmitStackTraceInFastThrow -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=20 -XX:GCLogFileSize=10M -Dlog4j.configuration=./log4j-executor.properties -Djava.security.auth.login.config=./jaas-zk.conf -Dzookeeper.server.principal=zookeeper/hadoop.<system domain name> -Djava.security.krb5.conf=./kdc.conf -Dcarbon.properties.filepath=./carbon.properties -Xloggc:<LOG_DIR>/gc.log -XX:+PrintGCDetails -XX:-OmitStackTraceInFastThrow -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=20 -XX:GCLogFileSize=10M -Dlog4j.configuration=./__spark_conf__/__hadoop_conf__/log4j-executor.properties -Djava.security.auth.login.config=./__spark_conf__/__hadoop_conf__/jaas-zk.conf -Dzookeeper.server.principal=zookeeper/hadoop.<system domain name> -Djava.security.krb5.conf=./__spark_conf__/__hadoop_conf__/kdc.conf -Dcarbon.properties.filepath=./__spark_conf__/__hadoop_conf__/carbon.properties -Djdk.tls.ephemeralDHKeySize=2048 |
spark.sql.authorization.enabled | Indicates whether to enable authentication for the Hive client. | true |