© Copyright. All rights reserved. Not to be reproduced without prior written consent.
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
1. Why do we need yet
another (open-source ) Copilot?
2. How can we build one?
3. Architecture and evaluation
4. DEMO
…howto turn bestpractices into AI coding assistant
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
(Data)Context is king!
● Explicit and precise data context of your whole data
platform
● Data transformation codebase
● Data models with comments
and table relationships
● Other user queries
● Lineage and human curated
dataset descriptions from
data catalogs
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
● open-source tools, such as WrenAI, Venna.AI, Dataherald
focus on Text-to-SQL to be embedded in web interfaces – i.e.
chatbots or own SQL editors – meant for non-technical users.
● closed source AI-Powered Assistants to BigQuery
(SQL+Dataform),Snowflake (SQL), Databricks (SQL+Python)
web interfaces,more like a black-box not-meant for
customizations.
● missingAnalytics EngineerCopilot with a dbt/SQL support
Data Assistants landscape
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Customized and specialized models are the future.
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
● Many other small (7-34b) models
licensed for commercial use, e.g. :
✓ starcoder2
✓ dolphincoder
✓ deepseeek-coder
✓ Opencodeinterpreter
✓ Llama3
sqlcoder-7b and others
May 9th updates
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
How turnyour best practices into Copilots ?
● Vector database as a knowledge base - what ?
● Prompts as instructions following best practices - how ?
● LLM to combine both…
Retrieval-Augmented
Generation(RAG)
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
RAG for Text-to-SQL
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Hybrid search
• combination of keyword and vector search
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
● a technique used to search for similar items
based on their vector representations, called
embeddings
● Approximate Nearest Neighbours algorithms
Vector search
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Data CopilotRAG architecture
Data programming
is more about
repeatable tasks!
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
GID Data Copilot(GDC)
● An extensible AI
programming assistant for
SQL and dbt code
● Powered by:
● Large Language Models
(SOTA LLMs)
● Robust Retrieval
Augmented Generation
(RAG) architecture
● Hybrid search techniques
● Fast Vector Database
● Curated Prompts
● Builtin Data commands
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Continue - an open-source copilot
● support for both tasks and tab autocompletion
● highly extensible
○ use any LLM model you wish - also multiple, specialized models for different
languages or tasks
○ support for many model providers, such as Ollama, vLLM, LM Studio
○ custom context providers for more control over LLMs augmentation
○ custom slash commands that can combine own prompts, contexts and
models for specialized, reusable tasks
● support for VSCode and Jetbrains
● secure (i.e. can be run locally, on-premise or cloud VPC)
● translate your best practices into ”slash data commands”
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Continue - a custom contextprovider
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
dbtSQL task = custom(context + prompt + model)
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Ollama
● fast and easy self-hosting of LLMs almost everywhere
● hybrid CPU+GPU inference
● powered by llama.cpp
● rich library of existing LLMs in different flavours
● GGUF - fast and memory efficient
quantization for GPU
● Serve model with one-liner:
ollama run starcoder2:7b
● vLLM for production deployments
(Our video tutorial)
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Ollama - custom model in 4 steps
1. Download a model in the GGUF format
2. Create a Modelfile, e.g.:
FROM ./sqlcoder-7b-q5_k_m.gguf
TEMPLATE """{{ .Prompt }}"""
PARAMETER stop "<|endoftext|>"
3. Create a model with Ollama
ollama create sqlcoder-7b-2 -f Modefile
4. Serve it
ollama run sqlcoder-7b-2
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
LanceDB
● fast (Rust ), serverless and embeddable - DuckDB for ML
● powered by Lance file format for ML (versioning,zero-copy)
● multi-modal
● support for hybrid (semantic + keyword) search
● Llamaindex integration
● Python API and fast data exchange
with polars and Arrow
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Technical architecture
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Questionrepresentation1
1
Text-to-SQLEmpowered by Large Language Models: A BenchmarkEvaluation
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
LLMs evaluation 1/2
● Not meant to be yet another benchmark, such as: Spider
sql-eval or Bird-SQL for jus SQL generation
● Jaffle Shop example - simple but not trivial
● Zero-shot – Agentic Workflow with Reflection TBD
● 4 typical data tasks
○ Data model exploration/discovery
○ SQL: an easy one (single table) and more complex (joins with sorting and
aggregations)
○ dbt model generation
○ dbt tests generation based on rules
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
LLMs evaluation 2/2
+- perfect or almost perfect
+/- - not optimal or some minor tweaks needed
-/+ - not very helpful, serious hallucinations
- - totally useless
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
gpt4 vs dbrx vs sqlcoder-7b-2 vs llama-3-sqlcoder-8b
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
fine-tuning impact: llama3-8b vs llama-3-sqlcoder-8b
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Quantization effects: dbrx 8/4/2 bit
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
A handful of conclusions…with a grain of salt
● NL-to-SQL and dbt code generationare challenging
● commercialmodels are in most cases still better… but
● there are very promising open-source 7-30b alternatives
● smaller models perform better than larger after quantization
● SQL-dedicated and fine-tuned models can turn out a bit a
disappointing (beam search?), e.g. :
○ unnecessary joins elimination
○ wrong data types inference
○ occasional hallucinations
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Future directions
● Implementation of in-context learning, such as Query
Similarity Selection (few-shot strategy) and Agentic RAG
with Reflection Strategy
● Model fine-tuning (dbt)
● Data Modeling (DV 2.0)
● Various SQL dialects and platform migrations
● Prompt optimizations, e.g. with DSPy
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
DEMO
© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf

  • 2.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent.
  • 3.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. 1. Why do we need yet another (open-source ) Copilot? 2. How can we build one? 3. Architecture and evaluation 4. DEMO …howto turn bestpractices into AI coding assistant
  • 4.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. (Data)Context is king! ● Explicit and precise data context of your whole data platform ● Data transformation codebase ● Data models with comments and table relationships ● Other user queries ● Lineage and human curated dataset descriptions from data catalogs
  • 5.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. ● open-source tools, such as WrenAI, Venna.AI, Dataherald focus on Text-to-SQL to be embedded in web interfaces – i.e. chatbots or own SQL editors – meant for non-technical users. ● closed source AI-Powered Assistants to BigQuery (SQL+Dataform),Snowflake (SQL), Databricks (SQL+Python) web interfaces,more like a black-box not-meant for customizations. ● missingAnalytics EngineerCopilot with a dbt/SQL support Data Assistants landscape
  • 6.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. Customized and specialized models are the future.
  • 7.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. ● Many other small (7-34b) models licensed for commercial use, e.g. : ✓ starcoder2 ✓ dolphincoder ✓ deepseeek-coder ✓ Opencodeinterpreter ✓ Llama3 sqlcoder-7b and others May 9th updates
  • 8.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. How turnyour best practices into Copilots ? ● Vector database as a knowledge base - what ? ● Prompts as instructions following best practices - how ? ● LLM to combine both… Retrieval-Augmented Generation(RAG)
  • 9.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. RAG for Text-to-SQL
  • 10.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. Hybrid search • combination of keyword and vector search
  • 11.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. ● a technique used to search for similar items based on their vector representations, called embeddings ● Approximate Nearest Neighbours algorithms Vector search
  • 12.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. Data CopilotRAG architecture Data programming is more about repeatable tasks!
  • 13.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. GID Data Copilot(GDC) ● An extensible AI programming assistant for SQL and dbt code ● Powered by: ● Large Language Models (SOTA LLMs) ● Robust Retrieval Augmented Generation (RAG) architecture ● Hybrid search techniques ● Fast Vector Database ● Curated Prompts ● Builtin Data commands
  • 14.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. Continue - an open-source copilot ● support for both tasks and tab autocompletion ● highly extensible ○ use any LLM model you wish - also multiple, specialized models for different languages or tasks ○ support for many model providers, such as Ollama, vLLM, LM Studio ○ custom context providers for more control over LLMs augmentation ○ custom slash commands that can combine own prompts, contexts and models for specialized, reusable tasks ● support for VSCode and Jetbrains ● secure (i.e. can be run locally, on-premise or cloud VPC) ● translate your best practices into ”slash data commands”
  • 15.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. Continue - a custom contextprovider
  • 16.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. dbtSQL task = custom(context + prompt + model)
  • 17.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. Ollama ● fast and easy self-hosting of LLMs almost everywhere ● hybrid CPU+GPU inference ● powered by llama.cpp ● rich library of existing LLMs in different flavours ● GGUF - fast and memory efficient quantization for GPU ● Serve model with one-liner: ollama run starcoder2:7b ● vLLM for production deployments (Our video tutorial)
  • 18.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. Ollama - custom model in 4 steps 1. Download a model in the GGUF format 2. Create a Modelfile, e.g.: FROM ./sqlcoder-7b-q5_k_m.gguf TEMPLATE """{{ .Prompt }}""" PARAMETER stop "<|endoftext|>" 3. Create a model with Ollama ollama create sqlcoder-7b-2 -f Modefile 4. Serve it ollama run sqlcoder-7b-2
  • 19.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. LanceDB ● fast (Rust ), serverless and embeddable - DuckDB for ML ● powered by Lance file format for ML (versioning,zero-copy) ● multi-modal ● support for hybrid (semantic + keyword) search ● Llamaindex integration ● Python API and fast data exchange with polars and Arrow
  • 20.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. Technical architecture
  • 21.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. Questionrepresentation1 1 Text-to-SQLEmpowered by Large Language Models: A BenchmarkEvaluation
  • 22.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. LLMs evaluation 1/2 ● Not meant to be yet another benchmark, such as: Spider sql-eval or Bird-SQL for jus SQL generation ● Jaffle Shop example - simple but not trivial ● Zero-shot – Agentic Workflow with Reflection TBD ● 4 typical data tasks ○ Data model exploration/discovery ○ SQL: an easy one (single table) and more complex (joins with sorting and aggregations) ○ dbt model generation ○ dbt tests generation based on rules
  • 23.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. LLMs evaluation 2/2 +- perfect or almost perfect +/- - not optimal or some minor tweaks needed -/+ - not very helpful, serious hallucinations - - totally useless
  • 24.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. gpt4 vs dbrx vs sqlcoder-7b-2 vs llama-3-sqlcoder-8b
  • 25.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. fine-tuning impact: llama3-8b vs llama-3-sqlcoder-8b
  • 26.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. Quantization effects: dbrx 8/4/2 bit
  • 27.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. A handful of conclusions…with a grain of salt ● NL-to-SQL and dbt code generationare challenging ● commercialmodels are in most cases still better… but ● there are very promising open-source 7-30b alternatives ● smaller models perform better than larger after quantization ● SQL-dedicated and fine-tuned models can turn out a bit a disappointing (beam search?), e.g. : ○ unnecessary joins elimination ○ wrong data types inference ○ occasional hallucinations
  • 28.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. Future directions ● Implementation of in-context learning, such as Query Similarity Selection (few-shot strategy) and Agentic RAG with Reflection Strategy ● Model fine-tuning (dbt) ● Data Modeling (DV 2.0) ● Various SQL dialects and platform migrations ● Prompt optimizations, e.g. with DSPy
  • 29.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent. DEMO
  • 30.
    © Copyright. Allrights reserved. Not to be reproduced without prior written consent.