Submitting Spark Application

I am a Tech Enthusiast having 13+ years of experience in ๐๐ as a ๐๐จ๐ง๐ฌ๐ฎ๐ฅ๐ญ๐๐ง๐ญ, ๐๐จ๐ซ๐ฉ๐จ๐ซ๐๐ญ๐ ๐๐ซ๐๐ข๐ง๐๐ซ, ๐๐๐ง๐ญ๐จ๐ซ, with 12+ years in training and mentoring in ๐๐จ๐๐ญ๐ฐ๐๐ซ๐ ๐๐ง๐ ๐ข๐ง๐๐๐ซ๐ข๐ง๐ , ๐๐๐ญ๐ ๐๐ง๐ ๐ข๐ง๐๐๐ซ๐ข๐ง๐ , ๐๐๐ฌ๐ญ ๐๐ฎ๐ญ๐จ๐ฆ๐๐ญ๐ข๐จ๐ง ๐๐ง๐ ๐๐๐ญ๐ ๐๐๐ข๐๐ง๐๐. I have ๐๐๐๐๐๐๐ ๐๐๐๐ ๐๐๐๐ 10,000+ ๐ฐ๐ป ๐ท๐๐๐๐๐๐๐๐๐๐๐๐ and ๐๐๐๐ ๐๐๐๐๐ ๐๐๐๐ ๐๐๐๐ 500+ ๐๐๐๐๐๐๐๐ ๐๐๐๐๐๐๐๐ in the areas of ๐๐จ๐๐ญ๐ฐ๐๐ซ๐ ๐๐๐ฏ๐๐ฅ๐จ๐ฉ๐ฆ๐๐ง๐ญ, ๐๐๐ญ๐ ๐๐ง๐ ๐ข๐ง๐๐๐ซ๐ข๐ง๐ , ๐๐ฅ๐จ๐ฎ๐, ๐๐๐ญ๐ ๐๐ง๐๐ฅ๐ฒ๐ฌ๐ข๐ฌ, ๐๐๐ญ๐ ๐๐ข๐ฌ๐ฎ๐๐ฅ๐ข๐ณ๐๐ญ๐ข๐จ๐ง๐ฌ, ๐๐ซ๐ญ๐ข๐๐ข๐๐ข๐๐ฅ ๐๐ง๐ญ๐๐ฅ๐ฅ๐ข๐ ๐๐ง๐๐ ๐๐ง๐ ๐๐๐๐ก๐ข๐ง๐ ๐๐๐๐ซ๐ง๐ข๐ง๐ . I am interested in ๐ฐ๐ซ๐ข๐ญ๐ข๐ง๐ ๐๐ฅ๐จ๐ ๐ฌ, ๐ฌ๐ก๐๐ซ๐ข๐ง๐ ๐ญ๐๐๐ก๐ง๐ข๐๐๐ฅ ๐ค๐ง๐จ๐ฐ๐ฅ๐๐๐ ๐, ๐ฌ๐จ๐ฅ๐ฏ๐ข๐ง๐ ๐ญ๐๐๐ก๐ง๐ข๐๐๐ฅ ๐ข๐ฌ๐ฌ๐ฎ๐๐ฌ, ๐ซ๐๐๐๐ข๐ง๐ ๐๐ง๐ ๐ฅ๐๐๐ซ๐ง๐ข๐ง๐ new subjects.
There are different ways to submit your application on a cluster but the most common is to use the spark-submit.
spark-submit
spark-submitis a command-line tool provided by Apache Spark for submitting Spark applications to a cluster. It is used to launch applications on a standalone Spark cluster, a Hadoop YARN cluster, or a Mesos cluster.
The spark-submit tool takes a JAR file or a Python file as input along with the applicationโs configuration options and submits the application to the cluster. The configuration options can be used to set various parameters for the application, such as the number of executor cores, the amount of memory allocated to each executor, and the number of executors.
The general structure of the spark-submit command is as follows:
./bin/spark-submit [options] <application-jar> [application-arguments]
spark-submit: The name of the command-line tool for submitting Spark applications.[options]: Optional command-line options that configure the behavior of the spark-submit tool and the Spark application being submitted.<application-jar>: The path to the JAR file containing the Spark application code. This JAR file must be created beforehand using a build tool. (A JAR (Java Archive) file is a package file format used to aggregate Java class files, associated metadata, and resources (such as images, sound files, and other supporting files) into a single file. It is a standard format used for distributing Java applications and libraries.)[application-arguments]: Optional command-line arguments that are passed to the Spark application's main method.
spark-submit options
some of the common options, configurations, and specific options to use with Scala and Python.
./bin/spark-submit --help
Submit Scala Application
./bin/spark-submit \
--master yarn \
--class org.nppntraining.WordCountExample \
user-data-analysis.jar
Submit PySpark Application
./bin/spark-submit \
--master yarn \
WordCountExample.py
Commonly used options
Here are some examples of commonly used options:
Using
--masteroption, you specify what cluster manager to use to run your application. Spark currently supports Yarn, Mesos, Kubernetes, Stand-alone, and local.--class <main-class>: Specifies the fully qualified name of the main class for the Spark application. This option is required for Java or Scala applications, but not needed for pySpark.
./bin/spark-submit \
--master yarn \
--class org.apache.spark.examples.SparkPi \
/home/npntraining/apache-spark-2.4.0/jars/spark-examples_versionxx.jar 80
Cluster Managers
Using --master option, you specify what cluster manager to use to run your application. Spark currently supports Yarn, Mesos, Kubernetes, Stand-alone, and local.
Deployment Modes
Using --deploy-mode, you specify where to run the Spark application driver program. Spark supports cluster and client deployment modes.
Cluster: In cluster mode, the driver runs on one of the worker nodes, and this node shows as a driver on the Spark Web UI of your application. cluster mode is used to run production jobs.
Client: In client mode, the driver runs locally where you are submitting your application from. in client mode, only the driver runs locally and all other executors run on different nodes on the cluster.
Driver program: This is your bundled jar application that drives the entire application. The Driver talks to an instance of Spark Master to submit your job to the cluster.
Driver and Executor Resources (Cores & Memory)
While submitting an application, you can also specify how much memory and cores you wanted to give for driver and executors.
--executor-memory <memory>: Specifies the amount of memory to allocate to each executor in the Spark application. The memory value can be specified in units such asg(gigabytes) orm(megabytes).--executor-coresNumber of CPU cores to use for the executor process.--total-executor-coresThe total number of executor cores to use.--driver-memoyMemory to be used by the Spark driver.--driver-coresCPU cores to be used by the Spark driver.--num-executorsThe total number of executors to use.
./bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--driver-memory 8g \
--executor-memory 16g \
--executor-cores 2 \
--class org.apache.spark.examples.SparkPi \
/home/npntraining/apache-spark-2.4.0/jars/spark-examples_versionxx.jar 80
Other Options
--files: Use the comma-separated files you wanted to use. Usually, these can be files from your resource folder. Using this option, Spark submits all these files to the cluster.
--verbose: Displays the verbose information. For example, writes all configurations the spark application uses to the log file.
Note: Files specified with --files are uploaded to the cluster.
Example: Below example submits the application to yarn cluster manager by using cluster deployment mode and with 16g driver memory, 32g, and 4 cores for each executor.
./bin/spark-submit \
--verbose
--master yarn \
--deploy-mode cluster \
--driver-memory 16g \
--executor-memory 32g \
--executor-cores 4 \
--files /path/log4j.properties,/path/file2.conf,/path/file3.json
--class org.apache.spark.examples.SparkPi \
/home/npntraining/apache-spark-2.4.0/jars/spark-examples_versionxx.jar 80
Spark Submit Configurations
Spark submit supports several configurations using --config, these configurations are used to specify Application configurations, shuffle parameters, runtime configurations.
| Configuration Key | Description |
| spark.sql.shuffle.partitions | Number of partitions to create for wider shuffle transformations (joins and aggregations). |
| spark.executor.memoryOverhead | The amount of additional memory to be allocated per executor process in cluster mode, it is typically memory for JVM overheads. (Not supported for PySpark) |
| spark.serializer | org.apache.spark.serializer.<br>JavaSerializer (default) |
org.apache.spark.serializer.KryoSerializer |
| spark.sql.files.maxPartitionBytes | The maximum number of bytes to be used for every partition when reading files. Default 128MB. |
| spark.dynamicAllocation.enabled | Specifies whether to dynamically increase or decrease the number of executors based on the workload. Default true. |
| spark.dynamicAllocation
.minExecutors | A minimum number of executors to use when dynamic allocation is enabled. |
| spark.dynamicAllocation
.maxExecutors | A maximum number of executors to use when dynamic allocation is enabled. |
| spark.executor.extraJavaOptions | Specify JVM options (see example below) |
./bin/spark2-submit \
--master yarn \
--deploy-mode cluster \
--conf "spark.sql.shuffle.partitions=20000" \
--conf "spark.executor.memoryOverhead=5244" \
--conf "spark.memory.fraction=0.8" \
--conf "spark.memory.storageFraction=0.2" \
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
--conf "spark.sql.files.maxPartitionBytes=168435456" \
--conf "spark.dynamicAllocation.minExecutors=1" \
--conf "spark.dynamicAllocation.maxExecutors=200" \
--conf "spark.dynamicAllocation.enabled=true" \
--conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \
--files /path/log4j.properties,/path/file2.conf,/path/file3.json \
--class org.apache.spark.examples.SparkPi \
/home/npntraining/apache-spark-2.4.0/jars/spark-examples_versionxx.jar 80
Reference: http://spark.apache.org/docs/latest/submitting-applications.html
If you like my work connect with me
I share tips, tricks and insights on #softwareengineering, #dataengineering #cloud #ml on LinkedIn.
Do you want to connect with me I have started mentoring others for career and interviews at ๐ญ๐จ๐ฉ๐ฆ๐๐ญ๐.๐ข๐จ/๐ง๐๐ฏ๐๐๐ง๐ฉ๐ง



