TL;DR

In this post we’re discussing the monitoring of Airflow DAGs with Prometheus and introducing our plugin: epoch8/airflow-exporter.

The Problem

At Epoch8 we’re using Prometheus for monitoring everything. So it’s no different when it comes to monitoring our ETL pipelines.

Our preferred ETL orchestration tool is Airflow. It has almost everything we need: programmatically defined execution graphs, Python as a development language and visual UI for introspection.

There’s only one caveat: Airflow’s built-in monitoring tools are near to non-existent.

Our typical use case is to have lots of DAGs running on a schedule (hourly or daily) in Airflow. You don’t need much monitoring when you’re in an active development: you have UI opened and you can check every now and then for errors.

Things get more complex when the active development phase is over, after you have deployed a bunch of DAGs to production, and moved on to some other work. At this point, you want to have some sort of alerting mechanism set up so that you don’t need to check for errors manually. When something fails - you get an alert, preferably into your ops channel in Slack.

There were two ways to implement this behavior:

  1. Modify each DAG to include a task that sends Slack an alert when some another task failed.
  2. Export DAGs and the task success status to an external monitoring system like Prometheus and set-up your alerts there.

We preferred the second way: integrating with Prometheus. For us, the ideal solution would cover both: monitoring for DAG and tasks failures and monitoring overall Airflow health and uptime.

Existing solutions

There was one project that did almost exactly what we needed: PBWebMedia/airflow-prometheus-exporter. It has only a couple of nuisances: it was an external to the Airflow component and it only supported MySQL.

With our first attempt, we patched the PBWebMedia/airflow-prometheus-exporter to support PostgreSQL and slightly altered the metrics schema. However, we didn’t like several things:

So we decided to write our own plugin for Airflow.

Plugin

There are a number of upsides using an integrated plugin: less parts to deploy, no need to copy-paste the configuration (like with database access) and the monitoring endpoint is down when Airflow is down (it is convenient because Prometheus can be set-up to send an alert when the target is down).

We implemented a simple plugin for Airflow: epoch8/airflow-exporter

It is installed by a simple checkout into the plugins/ directory:

git clone https://github.com/epoch8/airflow-exporter plugins/prometheus_exporter

Plugin creates new endpoint /admin/metrics and exposes two metrics:

Both metrics are operating with absolute values: the number of items with a given property. The idea is that production should remain clean and base statements about production should be simple. The statement “there should be no failed DagRuns in prod” is simple and easily maintaned.

Here’s an example of metrics from the airflow-exporter:

metrics

Setup Prometheus scraping and alerting

After you’ve installed the plugin, you can target Prometheus to collect metrics like this (in static_configs: section of prometheus.yml):

- job_name: 'airflow'
  metrics_path: /admin/metrics
  scheme: http
  - targets:
    - 'analytics.fm.epoch8.co:8080'
    labels:
      env: fm-prod

Set-up the alerting:

- name: fm-airflow.rules
  rules:
  - alert: AirflowDagAlert
    expr:  airflow_dag_status{env="fm-prod", status="failed"} > 0
    for: 30m
    labels:
      team: fm
    annotations:
      description: "{{ $labels.instance }} - DAG {{ $labels.dag_id }} failed \n"

This config will trigger an alert after 30 minutes of any DAG being in a failed state. This time is enough to avoid false alerts when deploying new code or experimenting.

Result

As a result you’ll have a flexible setup. Prometheus supports a variety of endpoints to send notifications to developers. We use Slack with a dedicated channel for operations. Here’s how it looks like in real life:

alert