DevilKing's blog

冷灯看剑,剑上几分功名?炉香无需计苍生,纵一穿烟逝,万丈云埋,孤阳还照古陵

0%

data traffic control in apache airflow

原文链接

  • Scale our data pipelines to process workloads of up to several terabytes a day efficiently
  • Adapt to the platform’s technological and organisational shift to a micro-service architecture
  • Quickly detect failures and inconsistencies in the many data processes run each day/hour
  • Respond to internal or external faults without impacting the quality, conformity, and availability of actionable information to our business users

It provides:

  • Retry mechanisms to ensure that each and every anomaly can be detected, and automatically or manually healed over time (with as little human intervention as possible)
  • Priority aware work queue management, ensuring that the most important tasks are run first and complete as soon as possible
  • Resource pooling system to ensure that, in a high concurrency environment, thresholds can be set to avoid overloading input or output systems
  • Backfill capabilities to identify “missing” past runs, and automatically re-create and run them
  • Full history of metrics and statistics to view the evolution of each task performance over time, and even assess data-delivery SLAs over time
  • An horizontally scalable set of alternatives to the way tasks are dispatched and run on a distributed infrastructure
  • A centralized, secure place to store and view logs and configuration parameters for all task runs

A central database that stores all stateful information

Airflow proposes several executor out of the box, from the simplest to the most full-featured:

  • SequentialExecutor: a very basic, single task at a time, executor that is also the default one. You do NOT want to use this one for anything but unit testing
  • LocalExecutor: also very basic, it runs the tasks on the same host as the scheduler, and is quite simple to set-up. It’s the best candidate for small, non-distributed deployments, and development environments, but won’t scale horizontally
  • CeleryExecutor: here we are beginning to scale out over a distributed cluster of Celery workers to cope with a large sized task set. Still quite easy to set-up and use, it’s the recommended setup for production
  • MesosExecutor: if you’re one of the cool kids, and have an existing Mesosinfrastructure, surely your will want to leverage as a destination for your task executions
  • KubernetesExecutor: if you’re an even cooler kid, support for Kuberneteshas been added in version 1.10.0

不支持:

No dynamic execution — The graph built by Airflow is built ahead of the actual execution. Airflow has a dynamic DAG generation system, which can rely on external parameters (configuration, or even Airflow variables), to alter the workflow’s graph. We use this pattern a lot, but it’s not possible to alter the shape of the workflow at runtime (for instance, spawn a variable number of tasks depending on the output of an upstream task)