- Scale our data pipelines to process workloads of up to several terabytes a day efficiently
- Adapt to the platform’s technological and organisational shift to a micro-service architecture
- Quickly detect failures and inconsistencies in the many data processes run each day/hour
- Respond to internal or external faults without impacting the quality, conformity, and availability of actionable information to our business users
It provides:
- Retry mechanisms to ensure that each and every anomaly can be detected, and automatically or manually healed over time (with as little human intervention as possible)
- Priority aware work queue management, ensuring that the most important tasks are run first and complete as soon as possible
- Resource pooling system to ensure that, in a high concurrency environment, thresholds can be set to avoid overloading input or output systems
- Backfill capabilities to identify “missing” past runs, and automatically re-create and run them
- Full history of metrics and statistics to view the evolution of each task performance over time, and even assess data-delivery SLAs over time
- An horizontally scalable set of alternatives to the way tasks are dispatched and run on a distributed infrastructure
- A centralized, secure place to store and view logs and configuration parameters for all task runs
A central database that stores all stateful information
Airflow proposes several executor out of the box, from the simplest to the most full-featured:
SequentialExecutor
: a very basic, single task at a time, executor that is also the default one. You do NOT want to use this one for anything but unit testingLocalExecutor
: also very basic, it runs the tasks on the same host as the scheduler, and is quite simple to set-up. It’s the best candidate for small, non-distributed deployments, and development environments, but won’t scale horizontallyCeleryExecutor
: here we are beginning to scale out over a distributed cluster of Celery workers to cope with a large sized task set. Still quite easy to set-up and use, it’s the recommended setup for productionMesosExecutor
: if you’re one of the cool kids, and have an existing Mesosinfrastructure, surely your will want to leverage as a destination for your task executionsKubernetesExecutor
: if you’re an even cooler kid, support for Kuberneteshas been added in version 1.10.0
不支持:
No dynamic execution — The graph built by Airflow is built ahead of the actual execution. Airflow has a dynamic DAG generation system, which can rely on external parameters (configuration, or even Airflow variables), to alter the workflow’s graph. We use this pattern a lot, but it’s not possible to alter the shape of the workflow at runtime (for instance, spawn a variable number of tasks depending on the output of an upstream task)