Skip to content
gqlxj1987's Blog
Go back

High Level overview of Spark

Edit page

原文链接

spark driver:

主要是针对调度job部分

spark executor:

主要就是执行job部分

Workflow:

  1. Our Standalone Application is kicked off, and initializes its SparkContext. Only after having a SparkContext can an app be referred to as a Driver
  2. Our Driver program asks the Cluster Manager for resources to launch its executors
  3. The Cluster Manager launches the executors
  4. Our Driver runs our actual Spark code
  5. Executors run tasks and send their results back to the driver
  6. SparkContext is stopped and all executors are shut down, returning resources back to the cluster

RDD: (Resilient Distributed Datasets)

RDDs are essentially the building blocks of Spark: everything is comprised of them. Even Sparks higher-level APIs (DataFrames, Datasets) are composed of RDDs under the hood.

With Spark, we need to think in a distributed context, always

RDDs are Immutable, meaning that once they are created, they cannot be altered in any way, they can only be transformed

尽量利用数据的re-use部分

DAG:

整个操作部分


Edit page
Share this post on:

Previous Post
mocking is a code smell
Next Post
Why bother writing tests at all