Submitting Spark Application

There are different ways to submit your application on a cluster but the most common is to use the spark-submit.

spark-submit

spark-submit is a command-line tool provided by Apache Spark for submitting Spark applications to a cluster. It is used to launch applications on a standalone Spark cluster, a Hadoop YARN cluster, or a Mesos cluster.

The spark-submit tool takes a JAR file or a Python file as input along with the application’s configuration options and submits the application to the cluster. The configuration options can be used to set various parameters for the application, such as the number of executor cores, the amount of memory allocated to each executor, and the number of executors.

The general structure of the spark-submit command is as follows:

./bin/spark-submit [options] <application-jar> [application-arguments]

spark-submit: The name of the command-line tool for submitting Spark applications.
[options]: Optional command-line options that configure the behavior of the spark-submit tool and the Spark application being submitted.
<application-jar>: The path to the JAR file containing the Spark application code. This JAR file must be created beforehand using a build tool. (A JAR (Java Archive) file is a package file format used to aggregate Java class files, associated metadata, and resources (such as images, sound files, and other supporting files) into a single file. It is a standard format used for distributing Java applications and libraries.)
[application-arguments]: Optional command-line arguments that are passed to the Spark application's main method.

spark-submit options

some of the common options, configurations, and specific options to use with Scala and Python.

./bin/spark-submit --help

Submit Scala Application

./bin/spark-submit \
--master yarn \
--class org.nppntraining.WordCountExample \
user-data-analysis.jar

Submit PySpark Application

./bin/spark-submit \
   --master yarn \
   WordCountExample.py

Commonly used options

Here are some examples of commonly used options:

Using --master option, you specify what cluster manager to use to run your application. Spark currently supports Yarn, Mesos, Kubernetes, Stand-alone, and local.
--class <main-class>: Specifies the fully qualified name of the main class for the Spark application. This option is required for Java or Scala applications, but not needed for pySpark.

./bin/spark-submit \
    --master yarn \
    --class org.apache.spark.examples.SparkPi \
    /home/npntraining/apache-spark-2.4.0/jars/spark-examples_versionxx.jar 80

Cluster Managers

Using --master option, you specify what cluster manager to use to run your application. Spark currently supports Yarn, Mesos, Kubernetes, Stand-alone, and local.

Deployment Modes

Using --deploy-mode, you specify where to run the Spark application driver program. Spark supports cluster and client deployment modes.

Cluster: In cluster mode, the driver runs on one of the worker nodes, and this node shows as a driver on the Spark Web UI of your application. cluster mode is used to run production jobs.

Client: In client mode, the driver runs locally where you are submitting your application from. in client mode, only the driver runs locally and all other executors run on different nodes on the cluster.

Driver program: This is your bundled jar application that drives the entire application. The Driver talks to an instance of Spark Master to submit your job to the cluster.

Driver and Executor Resources (Cores & Memory)

While submitting an application, you can also specify how much memory and cores you wanted to give for driver and executors.

--executor-memory <memory>: Specifies the amount of memory to allocate to each executor in the Spark application. The memory value can be specified in units such as g (gigabytes) or m (megabytes).
--executor-cores Number of CPU cores to use for the executor process.
--total-executor-coresThe total number of executor cores to use.
--driver-memoy Memory to be used by the Spark driver.
--driver-cores CPU cores to be used by the Spark driver.
--num-executorsThe total number of executors to use.

./bin/spark-submit \
   --master yarn \
   --deploy-mode cluster \
   --driver-memory 8g \
   --executor-memory 16g \
   --executor-cores 2  \
   --class org.apache.spark.examples.SparkPi \
   /home/npntraining/apache-spark-2.4.0/jars/spark-examples_versionxx.jar 80

Other Options

--files: Use the comma-separated files you wanted to use. Usually, these can be files from your resource folder. Using this option, Spark submits all these files to the cluster.

--verbose: Displays the verbose information. For example, writes all configurations the spark application uses to the log file.

Note: Files specified with --files are uploaded to the cluster.

Example: Below example submits the application to yarn cluster manager by using cluster deployment mode and with 16g driver memory, 32g, and 4 cores for each executor.

./bin/spark-submit \
   --verbose
   --master yarn \
   --deploy-mode cluster \
   --driver-memory 16g \
   --executor-memory 32g \
   --executor-cores 4  \
   --files /path/log4j.properties,/path/file2.conf,/path/file3.json
   --class org.apache.spark.examples.SparkPi \
   /home/npntraining/apache-spark-2.4.0/jars/spark-examples_versionxx.jar 80

Spark Submit Configurations

Spark submit supports several configurations using --config, these configurations are used to specify Application configurations, shuffle parameters, runtime configurations.

Configuration Key	Description
spark.sql.shuffle.partitions	Number of partitions to create for wider shuffle transformations (joins and aggregations).
spark.executor.memoryOverhead	The amount of additional memory to be allocated per executor process in cluster mode, it is typically memory for JVM overheads. (Not supported for PySpark)
spark.serializer	`org.apache.spark.serializer.<br>JavaSerializer` (default)

org.apache.spark.serializer.KryoSerializer | | spark.sql.files.maxPartitionBytes | The maximum number of bytes to be used for every partition when reading files. Default 128MB. | | spark.dynamicAllocation.enabled | Specifies whether to dynamically increase or decrease the number of executors based on the workload. Default true. | | spark.dynamicAllocation
.minExecutors | A minimum number of executors to use when dynamic allocation is enabled. | | spark.dynamicAllocation
.maxExecutors | A maximum number of executors to use when dynamic allocation is enabled. | | spark.executor.extraJavaOptions | Specify JVM options (see example below) |

./bin/spark2-submit \
--master yarn \
--deploy-mode cluster \
--conf "spark.sql.shuffle.partitions=20000" \
--conf "spark.executor.memoryOverhead=5244" \
--conf "spark.memory.fraction=0.8" \
--conf "spark.memory.storageFraction=0.2" \
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
--conf "spark.sql.files.maxPartitionBytes=168435456" \
--conf "spark.dynamicAllocation.minExecutors=1" \
--conf "spark.dynamicAllocation.maxExecutors=200" \
--conf "spark.dynamicAllocation.enabled=true" \
--conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \ 
--files /path/log4j.properties,/path/file2.conf,/path/file3.json \
--class org.apache.spark.examples.SparkPi \
/home/npntraining/apache-spark-2.4.0/jars/spark-examples_versionxx.jar 80

Reference: http://spark.apache.org/docs/latest/submitting-applications.html

If you like my work connect with me

I share tips, tricks and insights on #softwareengineering, #dataengineering #cloud #ml on LinkedIn.
Do you want to connect with me I have started mentoring others for career and interviews at 𝐭𝐨𝐩𝐦𝐚𝐭𝐞.𝐢𝐨/𝐧𝐚𝐯𝐞𝐞𝐧𝐩𝐧

Submitting Spark Application

spark-submit

spark-submit options

Commonly used options

Cluster Managers

Deployment Modes

Driver and Executor Resources (Cores & Memory)

Other Options

Spark Submit Configurations

Comments

More from this blog

ACID Properties

Key Problems Microsoft Fabric Solves

Unity Catalog vs Hive Metastore

Advanced Python Dependency Injection with Pydantic and FastAPI

Building Reactive Python Apps with Async Generators and Streams

Command Palette

spark-submit

spark-submit options

Commonly used options

Cluster Managers

Deployment Modes

Driver and Executor Resources (Cores & Memory)

Other Options

Spark Submit Configurations

Comments

More from this blog