Spark Submit 参数分析 | Devil King's Blog

Ever wondered how to configure —num-executors, —executor-memory and —execuor-cores spark config params for your cluster?

Lil bit theory: Let’s see some key recommendations that will help understand it better
Hands on: Next, we’ll take an example cluster and come up with recommended numbers to these spark params

Lil bit theory:

Hadoop/Yarn/OS Deamons:

serveral daemons that’ll run in the background like NameNode, Secondary NameNode, DataNode, JobTracker and TaskTracker

num-executors, we need to make sure that we leave aside enough cores (~1 core per node)

Yarn ApplicationMaster (AM)

If we are running spark on yarn, then we need to budget in the resources that AM would need (~1024MB and 1 Executor)

HDFS Throughput
MemoryOverhead

Full memory requested to yarn per executor =spark-executor-memory + spark.yarn.executor.memoryOverhead

spark.yarn.executor.memoryOverhead = Max(384MB, 7% of spark.executor-memory)

So, if we request 20GB per executor, AM will actually get 20GB + memoryOverhead = 20 + 7% of 20GB = ~23GB memory for us.

tips:

Running tiny executors (with a single core and just enough memory needed to run a single task, for example) throws away the benefits that come from running multiple tasks in a single JVM.

相关的配置:

Cluster Config: 10 Nodes 16 cores per Node 64GB RAM per Node

First Approach: Tiny executors [One Executor per core]:

--num-executors = In this approach, we'll assign one executor per core

= `total-cores-in-cluster`
= `num-cores-per-node * total-nodes-in-cluster`
= 16 x 10 = 160

--executor-cores = 1 (one executor per core)

--executor-memory = amount of memory per executor

= `mem-per-node/num-executors-per-node`
= 64GB/16 = 4GB

Not Good!

Second Approach: Fat executors (One Executor per node):

--num-executors = In this approach, we'll assign one executor per node

= `total-nodes-in-cluster`
= 10

--executor-cores = one executor per node means all the cores of the node are assigned to one executor
```
= `total-cores-in-a-node`
= 16
```
--executor-memory = amount of memory per executor
```
= `mem-per-node/num-executors-per-node`
= 64GB/1 = 64GB
```
Not Good for HDFS throughput

Third Approach: Balance between Fat (vs) Tiny
- Based on the recommendations mentioned above, Let’s assign 5 core per executors => —executor-cores = 5 (for good HDFS throughput)
- Leave 1 core per node for Hadoop/Yarn daemons => Num cores available per node = 16-1 = 15
- So, Total available of cores in cluster = 15 x 10 = 150
- Number of available executors = (total cores/num-cores-per-executor) = 150/5 = 30
- Leaving 1 executor for ApplicationManager => —num-executors = 29
- Number of executors per node = 30/10 = 3
- Memory per executor = 64GB/3 = 21GB
- Counting off heap overhead = 7% of 21GB = 3GB. So, actual —executor-memory = 21 - 3 = 18GB

recommended config is: 29 executors, 18GB memory each and 5 cores each!!