Creating your own Chat GPT with
Apache Airflow
@tati_alchueyr
Staff Software Engineer - Astronomer
13th July 2023 - AI Camp London Meetup
Turing test
https://marcabraham.com/2022/10/17/what-is-the-turing-test/
ChatGPT
https://chat.openai.com/
https://chat.openai.com/
https://xkcd.com/329/
inspect(ChatGPT)
● Artificial intelligence chatbot
● Developed by OpenAI
● Proprietary machine learning model
○ Uses LLM (Large Language Models)
○ GPT == Generative Pre-Trained Transformer
○ Fine-tuned GPT-3.5 (text-DaVinci-003)
● Over 100 million user base
● Dataset size: 570 GBs; 175 Billion Parameters
● Estimated cost to run per month: $3 million
https://www.theguardian.com/technology/2023/feb/02/chatgpt-100-million-users-open-ai-fastest-growing-app
https://indianexpress.com/article/technology/tech-news-technology/chatgpt-interesting-things-to-know-8334991/
https://meetanshi.com/blog/chatgpt-statistics/
help(LLM)
A Large Language Model is a type
of AI algorithm trained on huge
amounts of text data that can
understand and generate text
help(LLM)
LLM can be characterized by 4 parameters:
● Size of the training dataset
● Cost of training
● Size of the model
● Performance after training
timeline(LLM)
https://samim.io/p/2023-04-30-evolutionary-tree-of-llms/
Proprietary LLM limitations
● Data Privacy and Security
● Dependency and Customisation
● Cost and Scalability
● Access and Availability
Open-source LLM alternatives
● LLaMA (Meta)
● Alpaca (Stanford)
● Vicuna (Berkeley, Carnegie Mellon, Stanford)
● Dolly (Datricks)
● Open Assistant (individuals)
● h2oGPT (h2o)
https://bdtechtalks.com/2023/04/17/open-source-chatgpt-alternatives/
h2oGPT about
● Open-source (Apache 2.0) generative AI
● Empowers users to create their own language models
● https://gpt.h2o.ai/
● https://github.com/h2oai/h2ogpt
● https://www.youtube.com/watch?v=Coj72EzmX20&t=757s
https://bdtechtalks.com/2023/04/17/open-source-chatgpt-alternatives/
h2oGPT about
https://bdtechtalks.com/2023/04/17/open-source-chatgpt-alternatives/
Apache Airflow
Apache Airflow is an open-source
platform for developing,
scheduling, and monitoring
batch-oriented workflows.
help(airflow)
usage(airflow)
https://github.com/apache/airflow
https://pypistats.org/packages/apache-airflow
airflow.__author__
example(workflow)
airflow.concepts
airflow.concepts
airflow.concepts
airflow providers packages
https://airflow.apache.org/docs/apache-airflow-providers/packages-ref.html
● apache-airflow-providers-airbyte
● apache-airflow-providers-alibaba
● apache-airflow-providers-amazon
● apache-airflow-providers-apache-beam
● apache-airflow-providers-apache-cassandra
● apache-airflow-providers-apache-drill
● apache-airflow-providers-apache-druid
● apache-airflow-providers-apache-flink
● apache-airflow-providers-apache-hdfs
● apache-airflow-providers-apache-hive
● apache-airflow-providers-apache-impala
● apache-airflow-providers-apache-kafka
● apache-airflow-providers-apache-kylin
● apache-airflow-providers-apache-livy
● apache-airflow-providers-apache-pig
● apache-airflow-providers-apache-pinot
● apache-airflow-providers-apache-spark
● apache-airflow-providers-apache-sqoop
● apache-airflow-providers-apprise
● apache-airflow-providers-arangodb
● apache-airflow-providers-asana
● apache-airflow-providers-atlassian-jira
● apache-airflow-providers-celery
● apache-airflow-providers-cloudant
● apache-airflow-providers-cncf-kubernetes
● apache-airflow-providers-common-sql
● apache-airflow-providers-databricks
● apache-airflow-providers-datadog
● apache-airflow-providers-dbt-cloud
● apache-airflow-providers-dingding
● apache-airflow-providers-discord
● apache-airflow-providers-docker
● apache-airflow-providers-elasticsearch
● apache-airflow-providers-exasol
● apache-airflow-providers-facebook
● apache-airflow-providers-ftp
● apache-airflow-providers-github
● apache-airflow-providers-google
● apache-airflow-providers-grpc
● apache-airflow-providers-hashicorp
airflow providers packages
https://airflow.apache.org/docs/apache-airflow-providers/packages-ref.html
● apache-airflow-providers-http
● apache-airflow-providers-imap
● apache-airflow-providers-influxdb
● apache-airflow-providers-jdbc
● apache-airflow-providers-jenkins
● apache-airflow-providers-microsoft-azure
● apache-airflow-providers-microsoft-mssql
● apache-airflow-providers-microsoft-psrp
● apache-airflow-providers-microsoft-winrm
● apache-airflow-providers-mongo
● apache-airflow-providers-mysql
● apache-airflow-providers-neo4j
● apache-airflow-providers-odbc
● apache-airflow-providers-openfaas
● apache-airflow-providers-openlineage
● apache-airflow-providers-opsgenie
● apache-airflow-providers-oracle
● Apache-airflow-providers-pagerduty
● Apache-airflow-providers-papermill
● Apache-airflow-providers-plexus
● apache-airflow-providers-postgres
● apache-airflow-providers-presto
● apache-airflow-providers-qubole
● apache-airflow-providers-redis
● apache-airflow-providers-salesforce
● apache-airflow-providers-samba
● apache-airflow-providers-segment
● apache-airflow-providers-sendgrid
● apache-airflow-providers-sftp
● apache-airflow-providers-singularity
● apache-airflow-providers-slack
● apache-airflow-providers-smtp
● apache-airflow-providers-snowflake
● apache-airflow-providers-sqlite
● apache-airflow-providers-ssh
● apache-airflow-providers-tableau
● apache-airflow-providers-tabular
● apache-airflow-providers-telegram
● apache-airflow-providers-trino
● apache-airflow-providers-vertica
● apache-airflow-providers-zendesk
airflow example DAG
from airflow import DAG
from datetime import datetime
def train_model():
pass
with DAG(
“train_models",
start_date=datetime(2023, 7, 4),
schedule="@daily") as dag:
train_model = PythonOperator(
task_id="train_model",
python_callable=train_model
)
airflow example DAG
from airflow import DAG
from airflow.operators.python import PythonOperator
from random import randint
from datetime import datetime
def _evaluate_model():
return randint(1,10)
def _choose_best(ti):
tasks = [
"evaluate_model_a",
"evaluate_model_b"
]
accuracies = [ti.xcom_pull(task_id) for task_id in
tasks]
best_accuracy = max(accuracies)
for model, model_accuracy in zip(tasks,
accuracies):
if model_accuracy == best_accuracy:
return model
with DAG(
"evaluate_models",
start_date=datetime(2023, 7, 4),
schedule="@daily") as dag:
evaluate_model_a = PythonOperator(
task_id="evaluate_model_a",
python_callable=_evaluate_model
)
evaluate_model_b = PythonOperator(
task_id="evaluate_model_b",
python_callable=_evaluate_model
)
choose_best_model = PythonOperator(
task_id="choose_best_model",
python_callable=_choose_best
)
[evaluate_model_a, evaluate_model_b] >>
choose_best_model
airflow example DAG
airflow example of pipelines
airflow example of pipelines
airflow example of pipelines
Building an AI Chat Bot
with Airflow
Airflow to build a LLM Chat Bot
● Open-source and cloud-agnostic: you are not locked in!
● Same orchestration tool for ELT/ETL and ML
● Automate the steps of a model pipeline, using Airflow to:
○ Monitor the status and duration of tasks over time
○ Retry on failures
○ Send notifications (email, slack, others) to the team
● Dynamically trigger tasks using different hyper parameters
● Dynamically select models based on their scores
● Trigger model pipelines based of dataset changes
● Smoothly run tasks in VMs, containers or Kubernetes
Use the KubernetesPodOperator
● Create tasks which are run in Kubernetes pods
● Use node_affinity to allocate job to run on the nodepool
with the desired memory/CPU/GPU
● Use k8s.V1VolumeMount to efficiently mount volumes (e.g.
NFS) to access large models from different Pods (evaluate,
serve)
https://airflow.apache.org/docs/apache-airflow-providers-cncf-kubernetes/stable/operators.html
Use Dataset-aware scheduling
● Schedule when tasks (from other DAGs) complete successfully
from airflow.datasets import Dataset
with DAG(“ingest_dataset”, ...):
MyOperator(
# this task updates example.csv
outlets=[Dataset("s3://dataset-bucket/source-data.parquet")],
...,
)
with DAG(“train_model”,
# this DAG should be run when source-data.parquet is updated (by dag “ingest_dataset”)
schedule=[Dataset("s3://dataset-bucket/source_data.csv")],
...,
):
https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/datasets.html
Use Dynamic Task Mapping
● Create a variable number of tasks at runtime based upon the
data created by the previous task
● Can be useful in several situations, including chosing the most
adequate model
● Support map/reduce
https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/dynamic-task-mapping.html
Dynamic Task Mapping
from __future__ import annotations
from datetime import datetime
from airflow import DAG
from airflow.decorators import task
with DAG(
dag_id="example_dynamic_task_mapping",
start_date=datetime(2022, 3, 4)
) as dag:
@task
def evaluate_model(model_path):
(...)
return evaluation_metrics
@task
def chose_model(metrics_by_model):
(...)
return chosen_one
models_metrics = evaluate_model.expand(
model_path=["/data/model1", "/data/model2", "/data/model3"]
)
chose_model(models_metrics)
Apache Airflow
Community
Apache Airflow Community
https://airflow.apache.org/community/
https://github.com/apache/airflow
https://www.meetup.com/london-apache-airflow-meetup/
https://www.astronomer.io/
@tati_alchueyr
tatiana.alchueyr@astronomer.io
Thank you!

Integrating ChatGPT with Apache Airflow

  • 1.
    Creating your ownChat GPT with Apache Airflow @tati_alchueyr Staff Software Engineer - Astronomer 13th July 2023 - AI Camp London Meetup
  • 2.
  • 3.
  • 5.
  • 7.
  • 8.
  • 9.
    inspect(ChatGPT) ● Artificial intelligencechatbot ● Developed by OpenAI ● Proprietary machine learning model ○ Uses LLM (Large Language Models) ○ GPT == Generative Pre-Trained Transformer ○ Fine-tuned GPT-3.5 (text-DaVinci-003) ● Over 100 million user base ● Dataset size: 570 GBs; 175 Billion Parameters ● Estimated cost to run per month: $3 million https://www.theguardian.com/technology/2023/feb/02/chatgpt-100-million-users-open-ai-fastest-growing-app https://indianexpress.com/article/technology/tech-news-technology/chatgpt-interesting-things-to-know-8334991/ https://meetanshi.com/blog/chatgpt-statistics/
  • 10.
    help(LLM) A Large LanguageModel is a type of AI algorithm trained on huge amounts of text data that can understand and generate text
  • 11.
    help(LLM) LLM can becharacterized by 4 parameters: ● Size of the training dataset ● Cost of training ● Size of the model ● Performance after training
  • 12.
  • 13.
    Proprietary LLM limitations ●Data Privacy and Security ● Dependency and Customisation ● Cost and Scalability ● Access and Availability
  • 14.
    Open-source LLM alternatives ●LLaMA (Meta) ● Alpaca (Stanford) ● Vicuna (Berkeley, Carnegie Mellon, Stanford) ● Dolly (Datricks) ● Open Assistant (individuals) ● h2oGPT (h2o) https://bdtechtalks.com/2023/04/17/open-source-chatgpt-alternatives/
  • 15.
    h2oGPT about ● Open-source(Apache 2.0) generative AI ● Empowers users to create their own language models ● https://gpt.h2o.ai/ ● https://github.com/h2oai/h2ogpt ● https://www.youtube.com/watch?v=Coj72EzmX20&t=757s https://bdtechtalks.com/2023/04/17/open-source-chatgpt-alternatives/
  • 16.
  • 17.
  • 18.
    Apache Airflow isan open-source platform for developing, scheduling, and monitoring batch-oriented workflows. help(airflow)
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
    airflow providers packages https://airflow.apache.org/docs/apache-airflow-providers/packages-ref.html ●apache-airflow-providers-airbyte ● apache-airflow-providers-alibaba ● apache-airflow-providers-amazon ● apache-airflow-providers-apache-beam ● apache-airflow-providers-apache-cassandra ● apache-airflow-providers-apache-drill ● apache-airflow-providers-apache-druid ● apache-airflow-providers-apache-flink ● apache-airflow-providers-apache-hdfs ● apache-airflow-providers-apache-hive ● apache-airflow-providers-apache-impala ● apache-airflow-providers-apache-kafka ● apache-airflow-providers-apache-kylin ● apache-airflow-providers-apache-livy ● apache-airflow-providers-apache-pig ● apache-airflow-providers-apache-pinot ● apache-airflow-providers-apache-spark ● apache-airflow-providers-apache-sqoop ● apache-airflow-providers-apprise ● apache-airflow-providers-arangodb ● apache-airflow-providers-asana ● apache-airflow-providers-atlassian-jira ● apache-airflow-providers-celery ● apache-airflow-providers-cloudant ● apache-airflow-providers-cncf-kubernetes ● apache-airflow-providers-common-sql ● apache-airflow-providers-databricks ● apache-airflow-providers-datadog ● apache-airflow-providers-dbt-cloud ● apache-airflow-providers-dingding ● apache-airflow-providers-discord ● apache-airflow-providers-docker ● apache-airflow-providers-elasticsearch ● apache-airflow-providers-exasol ● apache-airflow-providers-facebook ● apache-airflow-providers-ftp ● apache-airflow-providers-github ● apache-airflow-providers-google ● apache-airflow-providers-grpc ● apache-airflow-providers-hashicorp
  • 26.
    airflow providers packages https://airflow.apache.org/docs/apache-airflow-providers/packages-ref.html ●apache-airflow-providers-http ● apache-airflow-providers-imap ● apache-airflow-providers-influxdb ● apache-airflow-providers-jdbc ● apache-airflow-providers-jenkins ● apache-airflow-providers-microsoft-azure ● apache-airflow-providers-microsoft-mssql ● apache-airflow-providers-microsoft-psrp ● apache-airflow-providers-microsoft-winrm ● apache-airflow-providers-mongo ● apache-airflow-providers-mysql ● apache-airflow-providers-neo4j ● apache-airflow-providers-odbc ● apache-airflow-providers-openfaas ● apache-airflow-providers-openlineage ● apache-airflow-providers-opsgenie ● apache-airflow-providers-oracle ● Apache-airflow-providers-pagerduty ● Apache-airflow-providers-papermill ● Apache-airflow-providers-plexus ● apache-airflow-providers-postgres ● apache-airflow-providers-presto ● apache-airflow-providers-qubole ● apache-airflow-providers-redis ● apache-airflow-providers-salesforce ● apache-airflow-providers-samba ● apache-airflow-providers-segment ● apache-airflow-providers-sendgrid ● apache-airflow-providers-sftp ● apache-airflow-providers-singularity ● apache-airflow-providers-slack ● apache-airflow-providers-smtp ● apache-airflow-providers-snowflake ● apache-airflow-providers-sqlite ● apache-airflow-providers-ssh ● apache-airflow-providers-tableau ● apache-airflow-providers-tabular ● apache-airflow-providers-telegram ● apache-airflow-providers-trino ● apache-airflow-providers-vertica ● apache-airflow-providers-zendesk
  • 27.
    airflow example DAG fromairflow import DAG from datetime import datetime def train_model(): pass with DAG( “train_models", start_date=datetime(2023, 7, 4), schedule="@daily") as dag: train_model = PythonOperator( task_id="train_model", python_callable=train_model )
  • 28.
    airflow example DAG fromairflow import DAG from airflow.operators.python import PythonOperator from random import randint from datetime import datetime def _evaluate_model(): return randint(1,10) def _choose_best(ti): tasks = [ "evaluate_model_a", "evaluate_model_b" ] accuracies = [ti.xcom_pull(task_id) for task_id in tasks] best_accuracy = max(accuracies) for model, model_accuracy in zip(tasks, accuracies): if model_accuracy == best_accuracy: return model with DAG( "evaluate_models", start_date=datetime(2023, 7, 4), schedule="@daily") as dag: evaluate_model_a = PythonOperator( task_id="evaluate_model_a", python_callable=_evaluate_model ) evaluate_model_b = PythonOperator( task_id="evaluate_model_b", python_callable=_evaluate_model ) choose_best_model = PythonOperator( task_id="choose_best_model", python_callable=_choose_best ) [evaluate_model_a, evaluate_model_b] >> choose_best_model
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
    Building an AIChat Bot with Airflow
  • 34.
    Airflow to builda LLM Chat Bot ● Open-source and cloud-agnostic: you are not locked in! ● Same orchestration tool for ELT/ETL and ML ● Automate the steps of a model pipeline, using Airflow to: ○ Monitor the status and duration of tasks over time ○ Retry on failures ○ Send notifications (email, slack, others) to the team ● Dynamically trigger tasks using different hyper parameters ● Dynamically select models based on their scores ● Trigger model pipelines based of dataset changes ● Smoothly run tasks in VMs, containers or Kubernetes
  • 35.
    Use the KubernetesPodOperator ●Create tasks which are run in Kubernetes pods ● Use node_affinity to allocate job to run on the nodepool with the desired memory/CPU/GPU ● Use k8s.V1VolumeMount to efficiently mount volumes (e.g. NFS) to access large models from different Pods (evaluate, serve) https://airflow.apache.org/docs/apache-airflow-providers-cncf-kubernetes/stable/operators.html
  • 36.
    Use Dataset-aware scheduling ●Schedule when tasks (from other DAGs) complete successfully from airflow.datasets import Dataset with DAG(“ingest_dataset”, ...): MyOperator( # this task updates example.csv outlets=[Dataset("s3://dataset-bucket/source-data.parquet")], ..., ) with DAG(“train_model”, # this DAG should be run when source-data.parquet is updated (by dag “ingest_dataset”) schedule=[Dataset("s3://dataset-bucket/source_data.csv")], ..., ): https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/datasets.html
  • 37.
    Use Dynamic TaskMapping ● Create a variable number of tasks at runtime based upon the data created by the previous task ● Can be useful in several situations, including chosing the most adequate model ● Support map/reduce https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/dynamic-task-mapping.html
  • 38.
    Dynamic Task Mapping from__future__ import annotations from datetime import datetime from airflow import DAG from airflow.decorators import task with DAG( dag_id="example_dynamic_task_mapping", start_date=datetime(2022, 3, 4) ) as dag: @task def evaluate_model(model_path): (...) return evaluation_metrics @task def chose_model(metrics_by_model): (...) return chosen_one models_metrics = evaluate_model.expand( model_path=["/data/model1", "/data/model2", "/data/model3"] ) chose_model(models_metrics)
  • 39.
  • 40.
  • 42.
  • 43.