DevilKing's blog



High Level overview of Spark


spark driver:

  • Runs on a node in our cluster, or on a client, and schedules the job execution with a cluster manager
  • Responds to user’s program or input
  • Analyzes, schedules, and distributes work across the exectors
  • Stores metadata about the running application and conveniently exposes it in a webUI


spark executor:

  • Executors perform all data processing of a Spark job
  • Stores results in memory, only persisting to disk when specifically instructed by the driver program
  • Returns results to the driver once they have been completed
  • Each node can have anywhere from 1 executor per node to 1 executor per core



  1. Our Standalone Application is kicked off, and initializes its SparkContext. Only after having a SparkContext can an app be referred to as a Driver
  2. Our Driver program asks the Cluster Manager for resources to launch its executors
  3. The Cluster Manager launches the executors
  4. Our Driver runs our actual Spark code
  5. Executors run tasks and send their results back to the driver
  6. SparkContext is stopped and all executors are shut down, returning resources back to the cluster

RDD: (Resilient Distributed Datasets)

RDDs are essentially the building blocks of Spark: everything is comprised of them. Even Sparks higher-level APIs (DataFrames, Datasets) are composed of RDDs under the hood.

With Spark, we need to think in a distributed context, always

RDDs are Immutable, meaning that once they are created, they cannot be altered in any way, they can only be transformed


