DevilKing's blog

冷灯看剑,剑上几分功名?炉香无需计苍生,纵一穿烟逝,万丈云埋,孤阳还照古陵

0%

High Level overview of Spark

原文链接

spark driver:

  • Runs on a node in our cluster, or on a client, and schedules the job execution with a cluster manager
  • Responds to user’s program or input
  • Analyzes, schedules, and distributes work across the exectors
  • Stores metadata about the running application and conveniently exposes it in a webUI

主要是针对调度job部分

spark executor:

  • Executors perform all data processing of a Spark job
  • Stores results in memory, only persisting to disk when specifically instructed by the driver program
  • Returns results to the driver once they have been completed
  • Each node can have anywhere from 1 executor per node to 1 executor per core

主要就是执行job部分

Workflow:

  1. Our Standalone Application is kicked off, and initializes its SparkContext. Only after having a SparkContext can an app be referred to as a Driver
  2. Our Driver program asks the Cluster Manager for resources to launch its executors
  3. The Cluster Manager launches the executors
  4. Our Driver runs our actual Spark code
  5. Executors run tasks and send their results back to the driver
  6. SparkContext is stopped and all executors are shut down, returning resources back to the cluster

RDD: (Resilient Distributed Datasets)

RDDs are essentially the building blocks of Spark: everything is comprised of them. Even Sparks higher-level APIs (DataFrames, Datasets) are composed of RDDs under the hood.

With Spark, we need to think in a distributed context, always

RDDs are Immutable, meaning that once they are created, they cannot be altered in any way, they can only be transformed

尽量利用数据的re-use部分

DAG:

整个操作部分