【原创】大数据基础之Airflow（2）生产环境部署airflow研究

一官方

airflow官方分布式部署结构图

airflow进程

webserver
scheduler
flower（非必须）
worker

airflow缺点

scheduler单点
通过在scheduler的dags目录变动dag文件来提交流程

官方分布式部署方案

多个webserver
多个worker
- CeleryExecutor（依赖redis或rabbitmq）
- MesosExecutor（依赖mesos）

第三方开源方案ASFC

针对scheduler单点问题，有第三方方案：https://github.com/teamclairvoyant/airflow-scheduler-failover-controller

The Airflow Scheduler Failover Controller (ASFC) is a mechanism that ensures that only one Scheduler instance is running in an Airflow Cluster at a time. This way you don‘t come across the issues we described in the "Motivation" section above.

You will first need to startup the ASFC on each of the instances you want the scheduler to be running on. When you start up multiple instances of the ASFC one of them takes on the Active state and the other takes on a Standby state. There is a heart beat mechanism setup to track if the Active ASFC is still active. If the Active ASFC misses multiple heart beats, the Standby ASFC becomes active.

The Active ASFC will poll every 10 seconds to see if the scheduler is running on the desired node. If it is not, the ASFC will try to restart the daemon. If the scheduler daemons still doesn‘t startup, the daemon is started on another node in the cluster.

Airflow Scheduler Failover Controller (ASFC)，实现方式为：多个实例中只有一个处于active状态，处于active状态的实例会每10s检查一下scheduler进程是否存活并根据需要重启进程；

坏消息是该方案和airflow新版本1.10不兼容

二基于mesos+hdfs的airflow生产环境部署方案研究

相同部分

和官方一致

使用mysql数据库作为元数据库

和官方不一致1

所有对dags目录的修改同步到hdfs上，保证dags目录的高可用
使用HDFS NFSGateway，将hdfs挂载到所有可能的scheduler节点上的，无论scheduler被部署在哪个节点上，都使用同一个dags目录
使用nginx+marathon-lb向外暴露airflow的webserver，可以操作流程或查看流程执行情况等

1 airflow单实例容器部署方案

和官方不一致2

webserver、scheduler、worker作为docker容器运行，在多个节点上只部署一个实例，由marathon保证可用性，由marathon-lb做服务发现
worker使用LocalExecutor，即所有的任务都使用子进程执行
- 为了使容器内的worker的LocalExecutor能够访问外部集群功能，一种可行的方式是将各种组件的父目录挂载到容器中（比如各个组件目录为/app/java、/app/hive、/app/spark、/app/hdfs，则挂载/app目录到容器内），然后所有的任务脚本一开始统一引入执行一个初始化环境变量的公共脚本，设置各种Home和Path，然后就可以在容器内使用各种组件的客户端，比如java、hive、spark、hdfs等

2 airflow分布式容器部署方案