Integrating ChatGPT with Apache Airflow

Creating your own Chat GPT with
Apache Airﬂow
@tati_alchueyr
Staff Software Engineer - Astronomer
13th July 2023 - AI Camp London Meetup

Turing test
https://marcabraham.com/2022/10/17/what-is-the-turing-test/

inspect(ChatGPT)
● Artificial intelligence chatbot
● Developed by OpenAI
● Proprietary machine learning model
○ Uses LLM (Large Language Models)
○ GPT == Generative Pre-Trained Transformer
○ Fine-tuned GPT-3.5 (text-DaVinci-003)
● Over 100 million user base
● Dataset size: 570 GBs; 175 Billion Parameters
● Estimated cost to run per month: $3 million
https://www.theguardian.com/technology/2023/feb/02/chatgpt-100-million-users-open-ai-fastest-growing-app
https://indianexpress.com/article/technology/tech-news-technology/chatgpt-interesting-things-to-know-8334991/
https://meetanshi.com/blog/chatgpt-statistics/

help(LLM)
A Large Language Model is a type
of AI algorithm trained on huge
amounts of text data that can
understand and generate text

help(LLM)
LLM can be characterized by 4 parameters:
● Size of the training dataset
● Cost of training
● Size of the model
● Performance after training

timeline(LLM)
https://samim.io/p/2023-04-30-evolutionary-tree-of-llms/

Proprietary LLM limitations
● Data Privacy and Security
● Dependency and Customisation
● Cost and Scalability
● Access and Availability

Open-source LLM alternatives
● LLaMA (Meta)
● Alpaca (Stanford)
● Vicuna (Berkeley, Carnegie Mellon, Stanford)
● Dolly (Datricks)
● Open Assistant (individuals)
● h2oGPT (h2o)
https://bdtechtalks.com/2023/04/17/open-source-chatgpt-alternatives/

h2oGPT about
● Open-source (Apache 2.0) generative AI
● Empowers users to create their own language models
● https://gpt.h2o.ai/
● https://github.com/h2oai/h2ogpt
● https://www.youtube.com/watch?v=Coj72EzmX20&t=757s

h2oGPT about

Apache Airflow is an open-source
platform for developing,
scheduling, and monitoring
batch-oriented workflows.
help(airflow)

usage(airflow)
https://github.com/apache/airflow
https://pypistats.org/packages/apache-airflow

airflow providers packages
https://airflow.apache.org/docs/apache-airflow-providers/packages-ref.html
● apache-airflow-providers-airbyte
● apache-airflow-providers-alibaba
● apache-airflow-providers-amazon
● apache-airflow-providers-apache-beam
● apache-airflow-providers-apache-cassandra
● apache-airflow-providers-apache-drill
● apache-airflow-providers-apache-druid
● apache-airflow-providers-apache-flink
● apache-airflow-providers-apache-hdfs
● apache-airflow-providers-apache-hive
● apache-airflow-providers-apache-impala
● apache-airflow-providers-apache-kafka
● apache-airflow-providers-apache-kylin
● apache-airflow-providers-apache-livy
● apache-airflow-providers-apache-pig
● apache-airflow-providers-apache-pinot
● apache-airflow-providers-apache-spark
● apache-airflow-providers-apache-sqoop
● apache-airflow-providers-apprise
● apache-airflow-providers-arangodb
● apache-airflow-providers-asana
● apache-airflow-providers-atlassian-jira
● apache-airflow-providers-celery
● apache-airflow-providers-cloudant
● apache-airflow-providers-cncf-kubernetes
● apache-airflow-providers-common-sql
● apache-airflow-providers-databricks
● apache-airflow-providers-datadog
● apache-airflow-providers-dbt-cloud
● apache-airflow-providers-dingding
● apache-airflow-providers-discord
● apache-airflow-providers-docker
● apache-airflow-providers-elasticsearch
● apache-airflow-providers-exasol
● apache-airflow-providers-facebook
● apache-airflow-providers-ftp
● apache-airflow-providers-github
● apache-airflow-providers-google
● apache-airflow-providers-grpc
● apache-airflow-providers-hashicorp

airflow providers packages
https://airflow.apache.org/docs/apache-airflow-providers/packages-ref.html
● apache-airflow-providers-http
● apache-airflow-providers-imap
● apache-airflow-providers-influxdb
● apache-airflow-providers-jdbc
● apache-airflow-providers-jenkins
● apache-airflow-providers-microsoft-azure
● apache-airflow-providers-microsoft-mssql
● apache-airflow-providers-microsoft-psrp
● apache-airflow-providers-microsoft-winrm
● apache-airflow-providers-mongo
● apache-airflow-providers-mysql
● apache-airflow-providers-neo4j
● apache-airflow-providers-odbc
● apache-airflow-providers-openfaas
● apache-airflow-providers-openlineage
● apache-airflow-providers-opsgenie
● apache-airflow-providers-oracle
● Apache-airflow-providers-pagerduty
● Apache-airflow-providers-papermill
● Apache-airflow-providers-plexus
● apache-airflow-providers-postgres
● apache-airflow-providers-presto
● apache-airflow-providers-qubole
● apache-airflow-providers-redis
● apache-airflow-providers-salesforce
● apache-airflow-providers-samba
● apache-airflow-providers-segment
● apache-airflow-providers-sendgrid
● apache-airflow-providers-sftp
● apache-airflow-providers-singularity
● apache-airflow-providers-slack
● apache-airflow-providers-smtp
● apache-airflow-providers-snowflake
● apache-airflow-providers-sqlite
● apache-airflow-providers-ssh
● apache-airflow-providers-tableau
● apache-airflow-providers-tabular
● apache-airflow-providers-telegram
● apache-airflow-providers-trino
● apache-airflow-providers-vertica
● apache-airflow-providers-zendesk

airflow example DAG
from airflow import DAG
from datetime import datetime
def train_model():
pass
with DAG(
“train_models",
start_date=datetime(2023, 7, 4),
schedule="@daily") as dag:
train_model = PythonOperator(
task_id="train_model",
python_callable=train_model
)

airflow example DAG
from airflow.operators.python import PythonOperator
from random import randint
def _evaluate_model():
return randint(1,10)
def _choose_best(ti):
tasks = [
"evaluate_model_a",
"evaluate_model_b"
]
accuracies = [ti.xcom_pull(task_id) for task_id in
tasks]
best_accuracy = max(accuracies)
for model, model_accuracy in zip(tasks,
accuracies):
if model_accuracy == best_accuracy:
return model
with DAG(
"evaluate_models",
start_date=datetime(2023, 7, 4),
schedule="@daily") as dag:
evaluate_model_a = PythonOperator(
task_id="evaluate_model_a",
python_callable=_evaluate_model
)
evaluate_model_b = PythonOperator(
task_id="evaluate_model_b",
python_callable=_evaluate_model
)
choose_best_model = PythonOperator(
task_id="choose_best_model",
python_callable=_choose_best
)
[evaluate_model_a, evaluate_model_b] >>
choose_best_model

Building an AI Chat Bot
with Airflow

Airflow to build a LLM Chat Bot
● Open-source and cloud-agnostic: you are not locked in!
● Same orchestration tool for ELT/ETL and ML
● Automate the steps of a model pipeline, using Airflow to:
○ Monitor the status and duration of tasks over time
○ Retry on failures
○ Send notifications (email, slack, others) to the team
● Dynamically trigger tasks using different hyper parameters
● Dynamically select models based on their scores
● Trigger model pipelines based of dataset changes
● Smoothly run tasks in VMs, containers or Kubernetes

Use the KubernetesPodOperator
● Create tasks which are run in Kubernetes pods
● Use node_affinity to allocate job to run on the nodepool
with the desired memory/CPU/GPU
● Use k8s.V1VolumeMount to efficiently mount volumes (e.g.
NFS) to access large models from different Pods (evaluate,
serve)
https://airflow.apache.org/docs/apache-airflow-providers-cncf-kubernetes/stable/operators.html

Use Dataset-aware scheduling
● Schedule when tasks (from other DAGs) complete successfully
from airflow.datasets import Dataset
with DAG(“ingest_dataset”, ...):
MyOperator(
# this task updates example.csv
outlets=[Dataset("s3://dataset-bucket/source-data.parquet")],
...,
)
with DAG(“train_model”,
# this DAG should be run when source-data.parquet is updated (by dag “ingest_dataset”)
schedule=[Dataset("s3://dataset-bucket/source_data.csv")],
...,
):
https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/datasets.html

Use Dynamic Task Mapping
● Create a variable number of tasks at runtime based upon the
data created by the previous task
● Can be useful in several situations, including chosing the most
adequate model
● Support map/reduce
https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/dynamic-task-mapping.html

Dynamic Task Mapping
from __future__ import annotations
from airflow.decorators import task
with DAG(
dag_id="example_dynamic_task_mapping",
start_date=datetime(2022, 3, 4)
) as dag:
@task
def evaluate_model(model_path):
(...)
return evaluation_metrics
@task
def chose_model(metrics_by_model):
(...)
return chosen_one
models_metrics = evaluate_model.expand(
model_path=["/data/model1", "/data/model2", "/data/model3"]
)
chose_model(models_metrics)

Apache Airflow Community
https://airflow.apache.org/community/
https://github.com/apache/airflow
https://www.meetup.com/london-apache-airflow-meetup/

@tati_alchueyr
tatiana.alchueyr@astronomer.io
Thank you!

Integrating ChatGPT with Apache Airflow

In this document

More Related Content

What's hot

Similar to Integrating ChatGPT with Apache Airflow

More from Tatiana Al-Chueyr

Recently uploaded

Integrating ChatGPT with Apache Airflow