Deploying a Multimodal RAG System Using Open Source Milvus, LlamaIndex, and vLLM

1 | © Copyright 8/16/23 Zilliz
Stephen Batifol | Zilliz
Webinar
Multimodal RAG using vLLM
and Pixtral

Stephen Batifol
Developer Advocate, Zilliz / Milvus
About Me
stephen.batifol@zilliz.com
linkedin.com/in/stephen-batifol/
@stephenbtl

| © Copyright 8/16/23 Zilliz
3
Milvus is an Open-Source Vector Database to
store, index, manage, and use the massive
number of embedding vectors generated by
deep neural networks and LLMs.
contributors
283
stars docker pulls
68M
forks
3.0K
+
33K
Milvus: The most widely-adopted vector database

● pip install on your laptop
● Plug into your favorite AI dev tools
● Push to production with a single line of code
Easy to start

Bulk Import GPU, Intel & ARM
CPU support
Disk Based
Index
Tiered Storage
Million+ level
tenant support
Hybrid Search
Dense & Sparse
RBAC, TLS,
Encryption
Float, Binary, &
Sparse Vector
Tag+Vector
Optimized Filtering
Dynamic Schema
Feature Rich

Milvus Lite Milvus Standalone Milvus Distributed
● Ideal for prototyping,
small scale
experiments.
● Easy to set up and
use, pip install
pymilvus
● Scale to ≈1M vectors
● Run on K8s
● Load balancer and
Multi-Node
Management
● Scaling of each
component
independently
● Scale to 100B
vectors
● Single-Node
Deployment
● Bundled in a single
Docker Image
● Supports Primary/
Secondary
● Scale up to 100M
vectors
Ready to scale 🚀
Write your code once, and run it everywhere, at scale!
● API and SDK are the same

Retrieval Augmented
Generation RAG
Expand LLMs' knowledge by
incorporating external data sources
into LLMs and your AI applications.
Match user behavior or content
features with other similar ones to
make effective recommendations.
Recommender System
Search for semantically similar
texts across vast amounts of
natural language documents.
Text/ Semantic Search
Image Similarity Search
Identify and search for visually
similar images or objects from a
vast collection of image libraries.
Video Similarity Search
Search for similar videos, scenes,
or objects from extensive
collections of video libraries.
Audio Similarity Search
Find similar audios in large datasets
for tasks like genre classification or
speech recognition
Molecular Similarity Search
Search for similar substructures,
superstructures, and other
structures for a specific molecule.
Anomaly Detection
Detect data points, events, and
observations that deviate
significantly from the usual pattern
Multimodal Similarity Search
Search over multiple types of data
simultaneously, e.g. text and
images
Common AI Use Cases

Use Case: Drug Discovery
Vectors: 12 Billion
Reqʼts: High Recall
Index: BIN_FLAT
Use Case: Data Search
Vectors: 2 Billion
Reqʼts: 200 ms, Cost mgmt
Index: DiskANN for cost savings
Use Case: Image Search
Vectors: 20 Billion
Reqʼts: High Insertion, Cost
Index: Disk Based Index
Use Case: Recommender System
Vectors: 20 Billion
Reqʼts: 5,000 QPS
Index: HNSW & CAGRA
Industry leaders already use vector search in their
apps

Well-connected in the AI infrastructure
Framework
Hardware
Infrastructure
Embedding Models LLMs
Software Infrastructure
Vector Database

Introduction to Vector DB
and Vector Search

Vectors unlock Unstructured Data

Vector Space

Vectors are for more than just text and images

How Similarity Search Works
Vn, 1
…
…
…
1
2
3
4
5
Transform into
Vectors
Unstructured Data
Images
User Generated
Content
Video
Documents
Audio
Vector Embeddings
Perform Approximate
Nearest Neighbor
Similarity Search
Perform Query
Get Results
Store in Vector Database

Embedding Models

Embeddings models workhorses of AI apps

17
Please! 🙏
Use Embedding Models
trained on Similar Data!

Multimodal Embeddings

Visual + language embeddings CLIP-like)

One embedding space, six modalities ImageBind)
Source: Girdhar, et al.

LLMs are becoming natively multimodal…

… and the best embedding models are too

23
RAG
Retrieval Augmented Generation)

Basic Idea
Use RAG to force the LLM to work with your data
by injecting it via a vector database like Milvus

Basic RAG Architecture

Question + Context
Question
Vanilla RAG is no longer enough…
Gen AI Model
Reliable Answers
Your
Documents
Embedding Model
Milvus
Search
What is the default
AUTOINDEX distance
metric in Milvus
Client?
The default
AUTOINDEX distance
metric in Milvus
Client is L2.

Question + Context
Question
… we need multimodal RAG
Pixtral
Reliable Answers
Multimodal Embeddings
Milvus
Search
What kind of music
did they play in the
pre-show?
The musician played
improvised electronic
music.

28
Building a Self-Hosted Multimodal
RAG System
Using Milvus and vLLM

● "We are deprecating your model!"
● Escape the Algorithm Garden - Customization, and
freedom to choose the best model for your specific
needs
● Own your AI destiny: Iterate without external
dependencies.
Why Self Host?

30

Self-Hosted Multimodal RAG
● Processes multiple data types (text, images, audio, video)
● Runs completely under your control
● Uses open-source
● Scales efficiently

● Milvus: Vector DB
● vLLM: Inference and serving
● Koyeb: Infrastructure Layer
● Pixtral: Multimodal model 400M vision
encoder + 12B decoder)
Tech Stack

Why vLLM?
Wide range of model support
● 40+ model architectures including
vision language models
● Collaborating with model vendors
Diverse hardware support
● NVIDIA, AMD, Intel GPUs
● Intel/AMD CPU
● Inferentia, TPU, Gaudi
End-to-end inference optimizations
● Paged Attention
● Speculative decoding
● Quantization GPTQ, AWQ, FP8
● Automatic prefix caching

Computing Attention without Cache

Computing Attention with KV Cache

● Autoscaling 🚀
● Scale to Zero 💲
● Build and Deploy almost everything 🛠
● Distributed Globally 🌍
Koyeb - Serverless AI Infrastructure

● Natively multimodal
● Strong performance on multimodal tasks, excels
in instruction following
● Architecture:
○ 400M parameter vision encoder trained from scratch
○ 12B parameter multimodal decoder based on Mistral
Nemo
○ Supports variable image sizes and aspect ratios
Pixtral from Mistral AI

Pixtral Architecture

Multimodal Architecture

Storage:
● Milvus for different modalities
● Efficient indexing and retrieval
Query Processing:
● Context retrieval from vector store
● Multimodal understanding with
Pixtral
What is it doing?
Video Processing:
● Frame extraction 0.2 FPS
● Audio transcription Whisper)
● Metadata extraction
Embeddings:
● Images: OpenAI CLIP
● Text: Mistral Embedding model

Complete Control
● No unexpected API changes
● Full visibility into the system
● Customizable components
Privacy & Security
● Data stays in your infrastructure
● No external API dependencies
Scalability
● Horizontal scaling with Milvus
● Efficient resource use with vLLM
● Flexible deployment options
Benefits

43
Demo!

milvus.io
github.com/milvus-io/
@milvusio
@stephenbtl
/in/stephen-batifol
Thank you

Deploying a Multimodal RAG System Using Open Source Milvus, LlamaIndex, and vLLM

More Related Content

What's hot

Similar to Deploying a Multimodal RAG System Using Open Source Milvus, LlamaIndex, and vLLM

More from Zilliz

Recently uploaded

Deploying a Multimodal RAG System Using Open Source Milvus, LlamaIndex, and vLLM