Spark Submit 参数分析

Ever wondered how to configure –num-executors, –executor-memory and –execuor-cores spark config params for your cluster?

Lil bit theory: Let’s see some key recommendations that will help understand it better
Hands on: Next, we’ll take an example cluster and come up with recommended numbers to these spark params

Lil bit theory:

Hadoop/Yarn/OS Deamons:

serveral daemons that’ll run in the background like NameNode, Secondary NameNode, DataNode, JobTracker and TaskTracker

num-executors, we need to make sure that we leave aside enough cores (~1 core per node)

Yarn ApplicationMaster (AM)

If we are running spark on yarn, then we need to budget in the resources that AM would need (~1024MB and 1 Executor)

HDFS Throughput
MemoryOverhead

Full memory requested to yarn per executor =spark-executor-memory + spark.yarn.executor.memoryOverhead

spark.yarn.executor.memoryOverhead = Max(384MB, 7% of spark.executor-memory)

So, if we request 20GB per executor, AM will actually get 20GB + memoryOverhead = 20 + 7% of 20GB = ~23GB memory for us.

tips:

Running tiny executors (with a single core and just enough memory needed to run a single task, for example) throws away the benefits that come from running multiple tasks in a single JVM.

相关的配置:

Cluster Config:
10 Nodes
16 cores per Node
64GB RAM per Node

First Approach: Tiny executors [One Executor per core]:

--num-executors = In this approach, we'll assign one executor per core

1
2
3

= `total-cores-in-cluster`
= `num-cores-per-node * total-nodes-in-cluster` 
= 16 x 10 = 160

--executor-cores = 1 (one executor per core)

--executor-memory = amount of memory per executor

1 2	= `mem-per-node/num-executors-per-node` = 64GB/16 = 4GB

Not Good!

Second Approach: Fat executors (One Executor per node):

--num-executors = In this approach, we'll assign one executor per node

1 2	= `total-nodes-in-cluster` = 10

--executor-cores = one executor per node means all the cores of the node are assigned to one executor
1
2
= `total-cores-in-a-node`
= 16

--executor-memory = amount of memory per executor

1 2	= `mem-per-node/num-executors-per-node` = 64GB/1 = 64GB

Not Good for HDFS throughput

Third Approach: Balance between Fat (vs) Tiny
- Based on the recommendations mentioned above, Let’s assign 5 core per executors => –executor-cores = 5 (for good HDFS throughput)
- Leave 1 core per node for Hadoop/Yarn daemons => Num cores available per node = 16-1 = 15
- So, Total available of cores in cluster = 15 x 10 = 150
- Number of available executors = (total cores/num-cores-per-executor) = 150/5 = 30
- Leaving 1 executor for ApplicationManager => –num-executors = 29
- Number of executors per node = 30/10 = 3
- Memory per executor = 64GB/3 = 21GB
- Counting off heap overhead = 7% of 21GB = 3GB. So, actual –executor-memory = 21 - 3 = 18GB

recommended config is: 29 executors, 18GB memory each and 5 cores each!!