DevilKing's blog

冷灯看剑,剑上几分功名?炉香无需计苍生,纵一穿烟逝,万丈云埋,孤阳还照古陵

0%

Spark Submit 参数分析

Ever wondered how to configure –num-executors, –executor-memory and –execuor-cores spark config params for your cluster?

  • Lil bit theory: Let’s see some key recommendations that will help understand it better

  • Hands on: Next, we’ll take an example cluster and come up with recommended numbers to these spark params

Lil bit theory:

  • Hadoop/Yarn/OS Deamons:

serveral daemons that’ll run in the background like NameNode, Secondary NameNode, DataNode, JobTracker and TaskTracker

num-executors, we need to make sure that we leave aside enough cores (~1 core per node)

  • Yarn ApplicationMaster (AM)

If we are running spark on yarn, then we need to budget in the resources that AM would need (~1024MB and 1 Executor)

  • HDFS Throughput

  • MemoryOverhead

Full memory requested to yarn per executor =spark-executor-memory + spark.yarn.executor.memoryOverhead

spark.yarn.executor.memoryOverhead = Max(384MB, 7% of spark.executor-memory)

So, if we request 20GB per executor, AM will actually get 20GB + memoryOverhead = 20 + 7% of 20GB = ~23GB memory for us.

tips:

Running tiny executors (with a single core and just enough memory needed to run a single task, for example) throws away the benefits that come from running multiple tasks in a single JVM.

相关的配置:

Cluster Config:
10 Nodes
16 cores per Node
64GB RAM per Node

  • First Approach: Tiny executors [One Executor per core]:
  • --num-executors = In this approach, we'll assign one executor per core
1
2
3
= `total-cores-in-cluster`
= `num-cores-per-node * total-nodes-in-cluster`
= 16 x 10 = 160
  • --executor-cores = 1 (one executor per core)
  • --executor-memory = amount of memory per executor
    1
    2
    = `mem-per-node/num-executors-per-node`
    = 64GB/16 = 4GB

Not Good!

  • Second Approach: Fat executors (One Executor per node):
  • --num-executors = In this approach, we'll assign one executor per node
1
2
= `total-nodes-in-cluster`
= 10
  • --executor-cores = one executor per node means all the cores of the node are assigned to one executor
    1
    2
    = `total-cores-in-a-node`
    = 16
  • --executor-memory = amount of memory per executor
    1
    2
    = `mem-per-node/num-executors-per-node`
    = 64GB/1 = 64GB

Not Good for HDFS throughput

  • Third Approach: Balance between Fat (vs) Tiny

    • Based on the recommendations mentioned above, Let’s assign 5 core per executors => –executor-cores = 5 (for good HDFS throughput)
    • Leave 1 core per node for Hadoop/Yarn daemons => Num cores available per node = 16-1 = 15
    • So, Total available of cores in cluster = 15 x 10 = 150
    • Number of available executors = (total cores/num-cores-per-executor) = 150/5 = 30
    • Leaving 1 executor for ApplicationManager => –num-executors = 29
    • Number of executors per node = 30/10 = 3
    • Memory per executor = 64GB/3 = 21GB
    • Counting off heap overhead = 7% of 21GB = 3GB. So, actual –executor-memory = 21 - 3 = 18GB

recommended config is: 29 executors, 18GB memory each and 5 cores each!!