High Level overview of Spark

原文链接

spark driver:

Runs on a node in our cluster, or on a client, and schedules the job execution with a cluster manager
Responds to user’s program or input
Analyzes, schedules, and distributes work across the exectors
Stores metadata about the running application and conveniently exposes it in a webUI

主要是针对调度job部分

spark executor:

Executors perform all data processing of a Spark job
Stores results in memory, only persisting to disk when specifically instructed by the driver program
Returns results to the driver once they have been completed
Each node can have anywhere from 1 executor per node to 1 executor per core

主要就是执行job部分

Workflow:

Our Standalone Application is kicked off, and initializes its SparkContext. Only after having a SparkContext can an app be referred to as a Driver
Our Driver program asks the Cluster Manager for resources to launch its executors
The Cluster Manager launches the executors
Our Driver runs our actual Spark code
Executors run tasks and send their results back to the driver
SparkContext is stopped and all executors are shut down, returning resources back to the cluster

RDD: (Resilient Distributed Datasets)

RDDs are essentially the building blocks of Spark: everything is comprised of them. Even Sparks higher-level APIs (DataFrames, Datasets) are composed of RDDs under the hood.

With Spark, we need to think in a distributed context, always

RDDs are Immutable, meaning that once they are created, they cannot be altered in any way, they can only be transformed

尽量利用数据的re-use部分

DAG：

整个操作部分