spark driver:
- Runs on a node in our cluster, or on a client, and schedules the job execution with a cluster manager
- Responds to user’s program or input
- Analyzes, schedules, and distributes work across the exectors
- Stores metadata about the running application and conveniently exposes it in a webUI
主要是针对调度job部分
spark executor:
- Executors perform all data processing of a Spark job
- Stores results in memory, only persisting to disk when specifically instructed by the driver program
- Returns results to the driver once they have been completed
- Each node can have anywhere from 1 executor per node to 1 executor per core
主要就是执行job部分
Workflow:
- Our Standalone Application is kicked off, and initializes its SparkContext. Only after having a SparkContext can an app be referred to as a Driver
- Our Driver program asks the Cluster Manager for resources to launch its executors
- The Cluster Manager launches the executors
- Our Driver runs our actual Spark code
- Executors run tasks and send their results back to the driver
- SparkContext is stopped and all executors are shut down, returning resources back to the cluster
RDD: (Resilient Distributed Datasets)
RDDs are essentially the building blocks of Spark: everything is comprised of them. Even Sparks higher-level APIs (DataFrames, Datasets) are composed of RDDs under the hood.
With Spark, we need to think in a distributed context, always
RDDs are Immutable
, meaning that once they are created, they cannot be altered in any way, they can only be transformed
尽量利用数据的re-use部分
DAG:
整个操作部分