Building Spark
JRE path
Maven installed
- always explicitly set the following in ‘.bashrc’ for ‘root’
- specify support you want explicitly
- Rebuild only modified source code
Running Spark
- always use ‘—verbose’ option on ‘spark-submit’ command to run your workland
- print
- all default properties
- command line options
- settings from spark’ conf file
- setting from cli
- order of lookup, ‘—package’
- the local maven repo
- maven central - web
- additional remote repositories
- OOM
- increase ‘—driver-memory’
- saprk sql and spark streaming need large driver heap size
- GC overhead limit exceeded
- too much time spent in garbage collection
- increase executor heapsize
- modify gc policy
- -XX:UseG1GC && -XX:UseParallelGC
- spark default: -XX:UseParallelGC
- Try overwrite with -XX:G1GC
- has a single SparkContext with multiple sessions supporting
- concurrency
- re-useable connections
- shared cache
- Not all CPUs are busy
- sart with evenly divided memory and cores.
-executor-memroy 2500m --num-executors 200
- when heap size non-negotiable.
--executor-memory 6g --num-executors 80
transfer to--executor-memory 6g --num-executors 80 -executor-cores 2
(Forcing 80% utilization, boosting 33% performance!)
- sart with evenly divided memory and cores.
- ‘scratch’ space
- ‘/tmp’ is full
- fix
spark.local.dir
- max result size exceeded
saprk.driver.maxResultSize
- out of space on a few data nodes
hdfs balancer
to start balancingdfs.datanode.balance.bandwidthPerSec
increase to 6GB/sdfs.datanode.balance.max.concurrent.moves
set to 5000 concurrent threads
Profiling Spark
- Yarn containers across multiple nodes
- get a full thread dump or get a full heap dump.
jstack -l
jmap -dump:live,format=b,file=xxx
- step1: find the hostname in the error log; step 2: find the local directory where ‘stderr’ resides, step 3: open the ‘stderr’,