OpenLineage &
Airflow - data lineage
has never been easier
May 2022
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Maciej Paweł
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
OpenLineage to build a healthy data ecosystem
Team A
Team C
Team B Interesting questions:
● What is the data
source?
● What is the schema?
● Who is the owner?
● How often is it
updated?
● Where does it come
from?
● Who is using it?
● What has changed?
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Infer or observe?
…or you can capture it
when the image is
originally created!
You can try to infer the
date and location of an
image after the fact…
rocks
26m until
sunset
haze
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
OpenLineage mission
● To define an open standard for the collection of lineage metadata from pipelines as they are running.
Graph DB
Backend
Producers
OpenLineage
Kafka topic
HTTP
client
Consumers
Kafka
client
GraphDB
client
Kafka
client
Kafka topic
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
● We want to achieve real-time notification about TaskInstance start, success, fail
● First way to do it? Subclassing DAG
- from airflow import DAG
+ from openlineage.airflow import DAG
● We can overload DAG methods and get notifications this way.
● Modify all the dags, have to set up openlineage-airflow locally.
The dark past of Airflow Integration
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
● We want to achieve real-time notification about TaskInstance start, success, fail
● First way to do it? Subclassing DAG
- from airflow import DAG
+ from openlineage.airflow import DAG
● We can overload DAG methods and get notifications this way.
● Modify all the dags, have to set up openlineage-airflow locally.
● Stopped working in Airflow 2
The dark past of Airflow Integration
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
● LineageBackend - sounds like right tool for the job?
● Both Airflow 1.10 and Airflow 2.1+ supported
● You can choose your LineageBackend in Airflow config
● Does not allow us to emit events on task start or failure
● We need those to reliably report what happened!
● Let’s contribute!
The closer, slightly dark past
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
● LineageBackend - sounds like right tool for the job?
● Both Airflow 1.10 and Airflow 2.1+ supported
● You can choose your LineageBackend in Airflow config
● Does not allow us to emit events on task start or failure
● We need those to reliably report what happened!
● Let’s contribute!
● Turns out it’s not so simple.
The closer, slightly dark past
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
● Let’s add new interface!
● We want our plugin to be notified when TaskInstanceState changes to RUNNING, SUCCESS, FAILED
The great present
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
● SqlAlchemy allows us to listen to existing database events
● AirflowPlugin mechanism allows us to automatically load plugin code from external Python packages
● Pluggy allows us to call registered plugins without needing to know what they are
The great present
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
● Okay
● Is present in Airflow 2.3!
The great present
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
● Extractors
● Built-in extractors -
○ BigQueryExtractor
○ SnowflakeExtractor
○ PostgresExtractor
○ GreatExpectationsExtractor
○ …
● Possibility to create custom extractors
Features
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
● Additional common library to help with writing extractors
● SQL parser
● Other integrations (dbt…) can use those features as well
Features
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
● Prevalence of PythonOperator
● Can we get data directly from Hooks?
● Hooks are very diverse.
The shiny future
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
● Prevalence of PythonOperator
● Can we get data directly from Hooks?
● Hooks are very diverse.
● AIP-48 solves a lot of those problems
The shiny future
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Top 3 recent features:
● Support for Spark 3.2.1
● Extensibility API
○ Possibility to write custom plugins to enrich existing OpenLineage events.
● Column level lineage
○ Which input columns were used to produce output column X?
Other:
● Spawning Spark from your Airflow DAG? We’ll keep track of that.
● Lifecycle state change - understand the meaning of DROP, DELETE, ALTER…
● Dataset versions for Iceberg & Delta
Apache Spark Integration
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Status:
● Under construction.
● We’re already able to:
○ Identify sources & sinks Kafka topics,
○ Fetch datasets’ schemas for Avro,
○ Include checkpoint statistics in OpenLineage events,
○ Retrieve information on Iceberg sources & sinks,
○ …
● Looking forward to publish first experimental version.
Apache Flink integration
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Status:
● github.com/OpenLineage/OpenLineage
You can contribute too!

OpenLineage & Airflow - data lineage has never been easier

  • 1.
    OpenLineage & Airflow -data lineage has never been easier May 2022
  • 2.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. Maciej Paweł
  • 3.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. OpenLineage to build a healthy data ecosystem Team A Team C Team B Interesting questions: ● What is the data source? ● What is the schema? ● Who is the owner? ● How often is it updated? ● Where does it come from? ● Who is using it? ● What has changed?
  • 4.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. Infer or observe? …or you can capture it when the image is originally created! You can try to infer the date and location of an image after the fact… rocks 26m until sunset haze
  • 5.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. OpenLineage mission ● To define an open standard for the collection of lineage metadata from pipelines as they are running. Graph DB Backend Producers OpenLineage Kafka topic HTTP client Consumers Kafka client GraphDB client Kafka client Kafka topic
  • 6.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. ● We want to achieve real-time notification about TaskInstance start, success, fail ● First way to do it? Subclassing DAG - from airflow import DAG + from openlineage.airflow import DAG ● We can overload DAG methods and get notifications this way. ● Modify all the dags, have to set up openlineage-airflow locally. The dark past of Airflow Integration
  • 7.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. ● We want to achieve real-time notification about TaskInstance start, success, fail ● First way to do it? Subclassing DAG - from airflow import DAG + from openlineage.airflow import DAG ● We can overload DAG methods and get notifications this way. ● Modify all the dags, have to set up openlineage-airflow locally. ● Stopped working in Airflow 2 The dark past of Airflow Integration
  • 8.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. ● LineageBackend - sounds like right tool for the job? ● Both Airflow 1.10 and Airflow 2.1+ supported ● You can choose your LineageBackend in Airflow config ● Does not allow us to emit events on task start or failure ● We need those to reliably report what happened! ● Let’s contribute! The closer, slightly dark past
  • 9.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. ● LineageBackend - sounds like right tool for the job? ● Both Airflow 1.10 and Airflow 2.1+ supported ● You can choose your LineageBackend in Airflow config ● Does not allow us to emit events on task start or failure ● We need those to reliably report what happened! ● Let’s contribute! ● Turns out it’s not so simple. The closer, slightly dark past
  • 10.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. ● Let’s add new interface! ● We want our plugin to be notified when TaskInstanceState changes to RUNNING, SUCCESS, FAILED The great present
  • 11.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. ● SqlAlchemy allows us to listen to existing database events ● AirflowPlugin mechanism allows us to automatically load plugin code from external Python packages ● Pluggy allows us to call registered plugins without needing to know what they are The great present
  • 12.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. ● Okay ● Is present in Airflow 2.3! The great present
  • 13.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. ● Extractors ● Built-in extractors - ○ BigQueryExtractor ○ SnowflakeExtractor ○ PostgresExtractor ○ GreatExpectationsExtractor ○ … ● Possibility to create custom extractors Features
  • 14.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. ● Additional common library to help with writing extractors ● SQL parser ● Other integrations (dbt…) can use those features as well Features
  • 15.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. ● Prevalence of PythonOperator ● Can we get data directly from Hooks? ● Hooks are very diverse. The shiny future
  • 16.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. ● Prevalence of PythonOperator ● Can we get data directly from Hooks? ● Hooks are very diverse. ● AIP-48 solves a lot of those problems The shiny future
  • 17.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. Top 3 recent features: ● Support for Spark 3.2.1 ● Extensibility API ○ Possibility to write custom plugins to enrich existing OpenLineage events. ● Column level lineage ○ Which input columns were used to produce output column X? Other: ● Spawning Spark from your Airflow DAG? We’ll keep track of that. ● Lifecycle state change - understand the meaning of DROP, DELETE, ALTER… ● Dataset versions for Iceberg & Delta Apache Spark Integration
  • 18.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. Status: ● Under construction. ● We’re already able to: ○ Identify sources & sinks Kafka topics, ○ Fetch datasets’ schemas for Avro, ○ Include checkpoint statistics in OpenLineage events, ○ Retrieve information on Iceberg sources & sinks, ○ … ● Looking forward to publish first experimental version. Apache Flink integration
  • 19.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. Status: ● github.com/OpenLineage/OpenLineage You can contribute too!