Lil bit theory: Let’s see some key recommendations that will help understand it better
Hands on: Next, we’ll take an example cluster and come up with recommended numbers to these spark params
Lil bit theory:
Hadoop/Yarn/OS Deamons:
serveral daemons that’ll run in the background like NameNode, Secondary NameNode, DataNode, JobTracker and TaskTracker
num-executors, we need to make sure that we leave aside enough cores (~1 core per node)
Yarn ApplicationMaster (AM)
If we are running spark on yarn, then we need to budget in the resources that AM would need (~1024MB and 1 Executor)
HDFS Throughput
MemoryOverhead
Full memory requested to yarn per executor =spark-executor-memory + spark.yarn.executor.memoryOverhead
spark.yarn.executor.memoryOverhead = Max(384MB, 7% of spark.executor-memory)
So, if we request 20GB per executor, AM will actually get 20GB + memoryOverhead = 20 + 7% of 20GB = ~23GB memory for us.
tips:
Running tiny executors (with a single core and just enough memory needed to run a single task, for example) throws away the benefits that come from running multiple tasks in a single JVM.
相关的配置:
Cluster Config: 10 Nodes 16 cores per Node 64GB RAM per Node
First Approach: Tiny executors [One Executor per core]:
--num-executors = In this approach, we'll assign one executor per core
"""Encode categorical integer features using a one-hot aka one-of-K scheme. The input to this transformer should be a matrix of integers, denoting the values taken on by categorical (discrete) features. The output will be a sparse matrix where each column corresponds to one possible value of one feature. It is assumed that input features take on values in the range [0, n_values). This encoding is needed for feeding categorical data to many scikit-learn estimators, notably linear models and SVMs with the standard kernels. Read more in the :ref:`User Guide <preprocessing_categorical_features>`. Attributes ---------- active_features_ : array Indices for active features, meaning values that actually occur in the training set. Only available when n_values is ``'auto'``. feature_indices_ : array of shape (n_features,) Indices to feature ranges. Feature ``i`` in the original data is mapped to features from ``feature_indices_[i]`` to ``feature_indices_[i+1]`` (and then potentially masked by `active_features_` afterwards) n_values_ : array of shape (n_features,) Maximum number of values per feature.