1 | © Copyright 8/16/23 Zilliz
1 | © Copyright 8/16/23 Zilliz
Stephen Batifol | Zilliz
Webinar
Multimodal RAG using vLLM
and Pixtral
2 | © Copyright 8/16/23 Zilliz
2 | © Copyright 8/16/23 Zilliz
Stephen Batifol
Developer Advocate, Zilliz / Milvus
About Me
stephen.batifol@zilliz.com
linkedin.com/in/stephen-batifol/
@stephenbtl
3 | © Copyright 8/16/23 Zilliz
3 | © Copyright 8/16/23 Zilliz
| © Copyright 8/16/23 Zilliz
3
Milvus is an Open-Source Vector Database to
store, index, manage, and use the massive
number of embedding vectors generated by
deep neural networks and LLMs.
contributors
283
stars docker pulls
68M
forks
3.0K
+
33K
Milvus: The most widely-adopted vector database
4 | © Copyright 8/16/23 Zilliz
4 | © Copyright 8/16/23 Zilliz
● pip install on your laptop
● Plug into your favorite AI dev tools
● Push to production with a single line of code
Easy to start
5 | © Copyright 8/16/23 Zilliz
5 | © Copyright 8/16/23 Zilliz
Bulk Import GPU, Intel & ARM
CPU support
Disk Based
Index
Tiered Storage
Million+ level
tenant support
Hybrid Search
Dense & Sparse
RBAC, TLS,
Encryption
Float, Binary, &
Sparse Vector
Tag+Vector
Optimized Filtering
Dynamic Schema
Feature Rich
6 | © Copyright 8/16/23 Zilliz
6 | © Copyright 8/16/23 Zilliz
Milvus Lite Milvus Standalone Milvus Distributed
● Ideal for prototyping,
small scale
experiments.
● Easy to set up and
use, pip install
pymilvus
● Scale to ≈1M vectors
● Run on K8s
● Load balancer and
Multi-Node
Management
● Scaling of each
component
independently
● Scale to 100B
vectors
● Single-Node
Deployment
● Bundled in a single
Docker Image
● Supports Primary/
Secondary
● Scale up to 100M
vectors
Ready to scale 🚀
Write your code once, and run it everywhere, at scale!
● API and SDK are the same
7 | © Copyright 8/16/23 Zilliz
7 | © Copyright 8/16/23 Zilliz
Retrieval Augmented
Generation RAG
Expand LLMs' knowledge by
incorporating external data sources
into LLMs and your AI applications.
Match user behavior or content
features with other similar ones to
make effective recommendations.
Recommender System
Search for semantically similar
texts across vast amounts of
natural language documents.
Text/ Semantic Search
Image Similarity Search
Identify and search for visually
similar images or objects from a
vast collection of image libraries.
Video Similarity Search
Search for similar videos, scenes,
or objects from extensive
collections of video libraries.
Audio Similarity Search
Find similar audios in large datasets
for tasks like genre classification or
speech recognition
Molecular Similarity Search
Search for similar substructures,
superstructures, and other
structures for a specific molecule.
Anomaly Detection
Detect data points, events, and
observations that deviate
significantly from the usual pattern
Multimodal Similarity Search
Search over multiple types of data
simultaneously, e.g. text and
images
Common AI Use Cases
8 | © Copyright 8/16/23 Zilliz
8 | © Copyright 8/16/23 Zilliz
Use Case: Drug Discovery
Vectors: 12 Billion
Reqʼts: High Recall
Index: BIN_FLAT
Use Case: Data Search
Vectors: 2 Billion
Reqʼts: 200 ms, Cost mgmt
Index: DiskANN for cost savings
Use Case: Image Search
Vectors: 20 Billion
Reqʼts: High Insertion, Cost
Index: Disk Based Index
Use Case: Recommender System
Vectors: 20 Billion
Reqʼts: 5,000 QPS
Index: HNSW & CAGRA
Industry leaders already use vector search in their
apps
9 | © Copyright 8/16/23 Zilliz
9 | © Copyright 8/16/23 Zilliz
Well-connected in the AI infrastructure
Framework
Hardware
Infrastructure
Embedding Models LLMs
Software Infrastructure
Vector Database
10 | © Copyright 8/16/23 Zilliz
10 | © Copyright 8/16/23 Zilliz
10 | © Copyright 8/16/23 Zilliz
10 | © Copyright 8/16/23 Zilliz
Introduction to Vector DB
and Vector Search
11 | © Copyright 8/16/23 Zilliz
11 | © Copyright 8/16/23 Zilliz
Vectors unlock Unstructured Data
12 | © Copyright 8/16/23 Zilliz
12 | © Copyright 8/16/23 Zilliz
Vector Space
13 | © Copyright 8/16/23 Zilliz
13 | © Copyright 8/16/23 Zilliz
Vectors are for more than just text and images
14 | © Copyright 8/16/23 Zilliz
14 | © Copyright 8/16/23 Zilliz
How Similarity Search Works
Vn, 1
…
…
…
1
2
3
4
5
Transform into
Vectors
Unstructured Data
Images
User Generated
Content
Video
Documents
Audio
Vector Embeddings
Perform Approximate
Nearest Neighbor
Similarity Search
Perform Query
Get Results
Store in Vector Database
15 | © Copyright 8/16/23 Zilliz
15 | © Copyright 8/16/23 Zilliz
15 | © Copyright 8/16/23 Zilliz
15 | © Copyright 8/16/23 Zilliz
Embedding Models
16 | © Copyright 8/16/23 Zilliz
16 | © Copyright 8/16/23 Zilliz
Embeddings models workhorses of AI apps
17 | © Copyright 8/16/23 Zilliz
17 | © Copyright 8/16/23 Zilliz
| © Copyright 8/16/23 Zilliz
17
Please! 🙏
Use Embedding Models
trained on Similar Data!
18 | © Copyright 8/16/23 Zilliz
18 | © Copyright 8/16/23 Zilliz
18 | © Copyright 8/16/23 Zilliz
18 | © Copyright 8/16/23 Zilliz
Multimodal Embeddings
19 | © Copyright 8/16/23 Zilliz
19 | © Copyright 8/16/23 Zilliz
Visual + language embeddings CLIP-like)
20 | © Copyright 8/16/23 Zilliz
20 | © Copyright 8/16/23 Zilliz
One embedding space, six modalities ImageBind)
Source: Girdhar, et al.
21 | © Copyright 8/16/23 Zilliz
21 | © Copyright 8/16/23 Zilliz
LLMs are becoming natively multimodal…
22 | © Copyright 8/16/23 Zilliz
22 | © Copyright 8/16/23 Zilliz
… and the best embedding models are too
23 | © Copyright 8/16/23 Zilliz
23 | © Copyright 8/16/23 Zilliz
| © Copyright 8/16/23 Zilliz
23
RAG
Retrieval Augmented Generation)
24 | © Copyright 8/16/23 Zilliz
24 | © Copyright 8/16/23 Zilliz
Basic Idea
Use RAG to force the LLM to work with your data
by injecting it via a vector database like Milvus
25 | © Copyright 8/16/23 Zilliz
25 | © Copyright 8/16/23 Zilliz
Basic RAG Architecture
26 | © Copyright 8/16/23 Zilliz
26 | © Copyright 8/16/23 Zilliz
Question + Context
Question
Vanilla RAG is no longer enough…
Gen AI Model
Reliable Answers
Your
Documents
Embedding Model
Milvus
Search
What is the default
AUTOINDEX distance
metric in Milvus
Client?
The default
AUTOINDEX distance
metric in Milvus
Client is L2.
27 | © Copyright 8/16/23 Zilliz
27 | © Copyright 8/16/23 Zilliz
Question + Context
Question
… we need multimodal RAG
Pixtral
Reliable Answers
Multimodal Embeddings
Milvus
Search
What kind of music
did they play in the
pre-show?
The musician played
improvised electronic
music.
28 | © Copyright 8/16/23 Zilliz
28 | © Copyright 8/16/23 Zilliz
| © Copyright 8/16/23 Zilliz
28
Building a Self-Hosted Multimodal
RAG System
Using Milvus and vLLM
29 | © Copyright 8/16/23 Zilliz
29 | © Copyright 8/16/23 Zilliz
● "We are deprecating your model!"
● Escape the Algorithm Garden - Customization, and
freedom to choose the best model for your specific
needs
● Own your AI destiny: Iterate without external
dependencies.
Why Self Host?
30 | © Copyright 8/16/23 Zilliz
30 | © Copyright 8/16/23 Zilliz
| © Copyright 8/16/23 Zilliz
30
31 | © Copyright 8/16/23 Zilliz
31 | © Copyright 8/16/23 Zilliz
Self-Hosted Multimodal RAG
● Processes multiple data types (text, images, audio, video)
● Runs completely under your control
● Uses open-source
● Scales efficiently
32 | © Copyright 8/16/23 Zilliz
32 | © Copyright 8/16/23 Zilliz
● Milvus: Vector DB
● vLLM: Inference and serving
● Koyeb: Infrastructure Layer
● Pixtral: Multimodal model 400M vision
encoder + 12B decoder)
Tech Stack
33 | © Copyright 8/16/23 Zilliz
33 | © Copyright 8/16/23 Zilliz
Why vLLM?
Wide range of model support
● 40+ model architectures including
vision language models
● Collaborating with model vendors
Diverse hardware support
● NVIDIA, AMD, Intel GPUs
● Intel/AMD CPU
● Inferentia, TPU, Gaudi
End-to-end inference optimizations
● Paged Attention
● Speculative decoding
● Quantization GPTQ, AWQ, FP8
● Automatic prefix caching
34 | © Copyright 8/16/23 Zilliz
34 | © Copyright 8/16/23 Zilliz
Computing Attention without Cache
35 | © Copyright 8/16/23 Zilliz
35 | © Copyright 8/16/23 Zilliz
Computing Attention with KV Cache
36 | © Copyright 8/16/23 Zilliz
36 | © Copyright 8/16/23 Zilliz
● Autoscaling 🚀
● Scale to Zero 💲
● Build and Deploy almost everything 🛠
● Distributed Globally 🌍
Koyeb - Serverless AI Infrastructure
37 | © Copyright 8/16/23 Zilliz
37 | © Copyright 8/16/23 Zilliz
● Natively multimodal
● Strong performance on multimodal tasks, excels
in instruction following
● Architecture:
○ 400M parameter vision encoder trained from scratch
○ 12B parameter multimodal decoder based on Mistral
Nemo
○ Supports variable image sizes and aspect ratios
Pixtral from Mistral AI
38 | © Copyright 8/16/23 Zilliz
38 | © Copyright 8/16/23 Zilliz
Pixtral Architecture
39 | © Copyright 8/16/23 Zilliz
39 | © Copyright 8/16/23 Zilliz
40 | © Copyright 8/16/23 Zilliz
40 | © Copyright 8/16/23 Zilliz
Multimodal Architecture
41 | © Copyright 8/16/23 Zilliz
41 | © Copyright 8/16/23 Zilliz
Storage:
● Milvus for different modalities
● Efficient indexing and retrieval
Query Processing:
● Context retrieval from vector store
● Multimodal understanding with
Pixtral
What is it doing?
Video Processing:
● Frame extraction 0.2 FPS
● Audio transcription Whisper)
● Metadata extraction
Embeddings:
● Images: OpenAI CLIP
● Text: Mistral Embedding model
42 | © Copyright 8/16/23 Zilliz
42 | © Copyright 8/16/23 Zilliz
Complete Control
● No unexpected API changes
● Full visibility into the system
● Customizable components
Privacy & Security
● Data stays in your infrastructure
● No external API dependencies
Scalability
● Horizontal scaling with Milvus
● Efficient resource use with vLLM
● Flexible deployment options
Benefits
43 | © Copyright 8/16/23 Zilliz
43 | © Copyright 8/16/23 Zilliz
| © Copyright 8/16/23 Zilliz
43
Demo!
44 | © Copyright 8/16/23 Zilliz
44 | © Copyright 8/16/23 Zilliz
milvus.io
github.com/milvus-io/
@milvusio
@stephenbtl
/in/stephen-batifol
Thank you
45 | © Copyright 8/16/23 Zilliz
45 | © Copyright 8/16/23 Zilliz

Deploying a Multimodal RAG System Using Open Source Milvus, LlamaIndex, and vLLM

  • 1.
    1 | ©Copyright 8/16/23 Zilliz 1 | © Copyright 8/16/23 Zilliz Stephen Batifol | Zilliz Webinar Multimodal RAG using vLLM and Pixtral
  • 2.
    2 | ©Copyright 8/16/23 Zilliz 2 | © Copyright 8/16/23 Zilliz Stephen Batifol Developer Advocate, Zilliz / Milvus About Me stephen.batifol@zilliz.com linkedin.com/in/stephen-batifol/ @stephenbtl
  • 3.
    3 | ©Copyright 8/16/23 Zilliz 3 | © Copyright 8/16/23 Zilliz | © Copyright 8/16/23 Zilliz 3 Milvus is an Open-Source Vector Database to store, index, manage, and use the massive number of embedding vectors generated by deep neural networks and LLMs. contributors 283 stars docker pulls 68M forks 3.0K + 33K Milvus: The most widely-adopted vector database
  • 4.
    4 | ©Copyright 8/16/23 Zilliz 4 | © Copyright 8/16/23 Zilliz ● pip install on your laptop ● Plug into your favorite AI dev tools ● Push to production with a single line of code Easy to start
  • 5.
    5 | ©Copyright 8/16/23 Zilliz 5 | © Copyright 8/16/23 Zilliz Bulk Import GPU, Intel & ARM CPU support Disk Based Index Tiered Storage Million+ level tenant support Hybrid Search Dense & Sparse RBAC, TLS, Encryption Float, Binary, & Sparse Vector Tag+Vector Optimized Filtering Dynamic Schema Feature Rich
  • 6.
    6 | ©Copyright 8/16/23 Zilliz 6 | © Copyright 8/16/23 Zilliz Milvus Lite Milvus Standalone Milvus Distributed ● Ideal for prototyping, small scale experiments. ● Easy to set up and use, pip install pymilvus ● Scale to ≈1M vectors ● Run on K8s ● Load balancer and Multi-Node Management ● Scaling of each component independently ● Scale to 100B vectors ● Single-Node Deployment ● Bundled in a single Docker Image ● Supports Primary/ Secondary ● Scale up to 100M vectors Ready to scale 🚀 Write your code once, and run it everywhere, at scale! ● API and SDK are the same
  • 7.
    7 | ©Copyright 8/16/23 Zilliz 7 | © Copyright 8/16/23 Zilliz Retrieval Augmented Generation RAG Expand LLMs' knowledge by incorporating external data sources into LLMs and your AI applications. Match user behavior or content features with other similar ones to make effective recommendations. Recommender System Search for semantically similar texts across vast amounts of natural language documents. Text/ Semantic Search Image Similarity Search Identify and search for visually similar images or objects from a vast collection of image libraries. Video Similarity Search Search for similar videos, scenes, or objects from extensive collections of video libraries. Audio Similarity Search Find similar audios in large datasets for tasks like genre classification or speech recognition Molecular Similarity Search Search for similar substructures, superstructures, and other structures for a specific molecule. Anomaly Detection Detect data points, events, and observations that deviate significantly from the usual pattern Multimodal Similarity Search Search over multiple types of data simultaneously, e.g. text and images Common AI Use Cases
  • 8.
    8 | ©Copyright 8/16/23 Zilliz 8 | © Copyright 8/16/23 Zilliz Use Case: Drug Discovery Vectors: 12 Billion Reqʼts: High Recall Index: BIN_FLAT Use Case: Data Search Vectors: 2 Billion Reqʼts: 200 ms, Cost mgmt Index: DiskANN for cost savings Use Case: Image Search Vectors: 20 Billion Reqʼts: High Insertion, Cost Index: Disk Based Index Use Case: Recommender System Vectors: 20 Billion Reqʼts: 5,000 QPS Index: HNSW & CAGRA Industry leaders already use vector search in their apps
  • 9.
    9 | ©Copyright 8/16/23 Zilliz 9 | © Copyright 8/16/23 Zilliz Well-connected in the AI infrastructure Framework Hardware Infrastructure Embedding Models LLMs Software Infrastructure Vector Database
  • 10.
    10 | ©Copyright 8/16/23 Zilliz 10 | © Copyright 8/16/23 Zilliz 10 | © Copyright 8/16/23 Zilliz 10 | © Copyright 8/16/23 Zilliz Introduction to Vector DB and Vector Search
  • 11.
    11 | ©Copyright 8/16/23 Zilliz 11 | © Copyright 8/16/23 Zilliz Vectors unlock Unstructured Data
  • 12.
    12 | ©Copyright 8/16/23 Zilliz 12 | © Copyright 8/16/23 Zilliz Vector Space
  • 13.
    13 | ©Copyright 8/16/23 Zilliz 13 | © Copyright 8/16/23 Zilliz Vectors are for more than just text and images
  • 14.
    14 | ©Copyright 8/16/23 Zilliz 14 | © Copyright 8/16/23 Zilliz How Similarity Search Works Vn, 1 … … … 1 2 3 4 5 Transform into Vectors Unstructured Data Images User Generated Content Video Documents Audio Vector Embeddings Perform Approximate Nearest Neighbor Similarity Search Perform Query Get Results Store in Vector Database
  • 15.
    15 | ©Copyright 8/16/23 Zilliz 15 | © Copyright 8/16/23 Zilliz 15 | © Copyright 8/16/23 Zilliz 15 | © Copyright 8/16/23 Zilliz Embedding Models
  • 16.
    16 | ©Copyright 8/16/23 Zilliz 16 | © Copyright 8/16/23 Zilliz Embeddings models workhorses of AI apps
  • 17.
    17 | ©Copyright 8/16/23 Zilliz 17 | © Copyright 8/16/23 Zilliz | © Copyright 8/16/23 Zilliz 17 Please! 🙏 Use Embedding Models trained on Similar Data!
  • 18.
    18 | ©Copyright 8/16/23 Zilliz 18 | © Copyright 8/16/23 Zilliz 18 | © Copyright 8/16/23 Zilliz 18 | © Copyright 8/16/23 Zilliz Multimodal Embeddings
  • 19.
    19 | ©Copyright 8/16/23 Zilliz 19 | © Copyright 8/16/23 Zilliz Visual + language embeddings CLIP-like)
  • 20.
    20 | ©Copyright 8/16/23 Zilliz 20 | © Copyright 8/16/23 Zilliz One embedding space, six modalities ImageBind) Source: Girdhar, et al.
  • 21.
    21 | ©Copyright 8/16/23 Zilliz 21 | © Copyright 8/16/23 Zilliz LLMs are becoming natively multimodal…
  • 22.
    22 | ©Copyright 8/16/23 Zilliz 22 | © Copyright 8/16/23 Zilliz … and the best embedding models are too
  • 23.
    23 | ©Copyright 8/16/23 Zilliz 23 | © Copyright 8/16/23 Zilliz | © Copyright 8/16/23 Zilliz 23 RAG Retrieval Augmented Generation)
  • 24.
    24 | ©Copyright 8/16/23 Zilliz 24 | © Copyright 8/16/23 Zilliz Basic Idea Use RAG to force the LLM to work with your data by injecting it via a vector database like Milvus
  • 25.
    25 | ©Copyright 8/16/23 Zilliz 25 | © Copyright 8/16/23 Zilliz Basic RAG Architecture
  • 26.
    26 | ©Copyright 8/16/23 Zilliz 26 | © Copyright 8/16/23 Zilliz Question + Context Question Vanilla RAG is no longer enough… Gen AI Model Reliable Answers Your Documents Embedding Model Milvus Search What is the default AUTOINDEX distance metric in Milvus Client? The default AUTOINDEX distance metric in Milvus Client is L2.
  • 27.
    27 | ©Copyright 8/16/23 Zilliz 27 | © Copyright 8/16/23 Zilliz Question + Context Question … we need multimodal RAG Pixtral Reliable Answers Multimodal Embeddings Milvus Search What kind of music did they play in the pre-show? The musician played improvised electronic music.
  • 28.
    28 | ©Copyright 8/16/23 Zilliz 28 | © Copyright 8/16/23 Zilliz | © Copyright 8/16/23 Zilliz 28 Building a Self-Hosted Multimodal RAG System Using Milvus and vLLM
  • 29.
    29 | ©Copyright 8/16/23 Zilliz 29 | © Copyright 8/16/23 Zilliz ● "We are deprecating your model!" ● Escape the Algorithm Garden - Customization, and freedom to choose the best model for your specific needs ● Own your AI destiny: Iterate without external dependencies. Why Self Host?
  • 30.
    30 | ©Copyright 8/16/23 Zilliz 30 | © Copyright 8/16/23 Zilliz | © Copyright 8/16/23 Zilliz 30
  • 31.
    31 | ©Copyright 8/16/23 Zilliz 31 | © Copyright 8/16/23 Zilliz Self-Hosted Multimodal RAG ● Processes multiple data types (text, images, audio, video) ● Runs completely under your control ● Uses open-source ● Scales efficiently
  • 32.
    32 | ©Copyright 8/16/23 Zilliz 32 | © Copyright 8/16/23 Zilliz ● Milvus: Vector DB ● vLLM: Inference and serving ● Koyeb: Infrastructure Layer ● Pixtral: Multimodal model 400M vision encoder + 12B decoder) Tech Stack
  • 33.
    33 | ©Copyright 8/16/23 Zilliz 33 | © Copyright 8/16/23 Zilliz Why vLLM? Wide range of model support ● 40+ model architectures including vision language models ● Collaborating with model vendors Diverse hardware support ● NVIDIA, AMD, Intel GPUs ● Intel/AMD CPU ● Inferentia, TPU, Gaudi End-to-end inference optimizations ● Paged Attention ● Speculative decoding ● Quantization GPTQ, AWQ, FP8 ● Automatic prefix caching
  • 34.
    34 | ©Copyright 8/16/23 Zilliz 34 | © Copyright 8/16/23 Zilliz Computing Attention without Cache
  • 35.
    35 | ©Copyright 8/16/23 Zilliz 35 | © Copyright 8/16/23 Zilliz Computing Attention with KV Cache
  • 36.
    36 | ©Copyright 8/16/23 Zilliz 36 | © Copyright 8/16/23 Zilliz ● Autoscaling 🚀 ● Scale to Zero 💲 ● Build and Deploy almost everything 🛠 ● Distributed Globally 🌍 Koyeb - Serverless AI Infrastructure
  • 37.
    37 | ©Copyright 8/16/23 Zilliz 37 | © Copyright 8/16/23 Zilliz ● Natively multimodal ● Strong performance on multimodal tasks, excels in instruction following ● Architecture: ○ 400M parameter vision encoder trained from scratch ○ 12B parameter multimodal decoder based on Mistral Nemo ○ Supports variable image sizes and aspect ratios Pixtral from Mistral AI
  • 38.
    38 | ©Copyright 8/16/23 Zilliz 38 | © Copyright 8/16/23 Zilliz Pixtral Architecture
  • 39.
    39 | ©Copyright 8/16/23 Zilliz 39 | © Copyright 8/16/23 Zilliz
  • 40.
    40 | ©Copyright 8/16/23 Zilliz 40 | © Copyright 8/16/23 Zilliz Multimodal Architecture
  • 41.
    41 | ©Copyright 8/16/23 Zilliz 41 | © Copyright 8/16/23 Zilliz Storage: ● Milvus for different modalities ● Efficient indexing and retrieval Query Processing: ● Context retrieval from vector store ● Multimodal understanding with Pixtral What is it doing? Video Processing: ● Frame extraction 0.2 FPS ● Audio transcription Whisper) ● Metadata extraction Embeddings: ● Images: OpenAI CLIP ● Text: Mistral Embedding model
  • 42.
    42 | ©Copyright 8/16/23 Zilliz 42 | © Copyright 8/16/23 Zilliz Complete Control ● No unexpected API changes ● Full visibility into the system ● Customizable components Privacy & Security ● Data stays in your infrastructure ● No external API dependencies Scalability ● Horizontal scaling with Milvus ● Efficient resource use with vLLM ● Flexible deployment options Benefits
  • 43.
    43 | ©Copyright 8/16/23 Zilliz 43 | © Copyright 8/16/23 Zilliz | © Copyright 8/16/23 Zilliz 43 Demo!
  • 44.
    44 | ©Copyright 8/16/23 Zilliz 44 | © Copyright 8/16/23 Zilliz milvus.io github.com/milvus-io/ @milvusio @stephenbtl /in/stephen-batifol Thank you
  • 45.
    45 | ©Copyright 8/16/23 Zilliz 45 | © Copyright 8/16/23 Zilliz