1 | © Copyright 2024 Zilliz
1
Unstructured Data and LLM What, Why
and How
Tim Spann @ Zilliz
Who has
heard of
Vector
Databases?
Who has used
a vector
database?
Who has
heard of
Milvus?
5 | © Copyright Zilliz
5
01 Introduction
6 | © Copyright 2024 Zilliz
6
6 | © Copyright 10/22/23 Zilliz
6 | © Copyright 2024 Zilliz
Tim Spann
Principal Developer
Advocate, Zilliz
tim.spann@zilliz.com
https://www.linkedin.com/in/timothyspann/
https://x.com/PaaSDev
The challenge of Unstructured Data
● Problem: Unstructured data comes in lots of forms, no easy
way to interact with it all
● Solution: Vector embeddings
● How: Neural networks e.g. embedding models
Vector
Databases
Show Me
9 | © Copyright Zilliz
9
02 What?
10 | © Copyright Zilliz
10
Unstructured Data is Everywhere
Unstructured data is any data that does not conform
to a predefined data model.
Currently, 90% of unstructured data is never
analyzed.
Images Videos and
more!
Text
12 | © Copyright 2024 Zilliz
12
10%
Other
newly generated data in 2025
will be unstructured data
90%
Data Source: The Digitization of the World by IDC
Why is Semantic Search so important?
13 | © Copyright 2024 Zilliz
13
or
Apple
or
Rising dough
or
Change car tire
Rising Dough
Proofing Bread
✔
❌
Semantic Search vs Lexical Search
Vn, 1
…
…
…
1
2
3
4
5
Transform into
Vectors
Unstructured Data
Images
User Generated
Content
Video
Documents
Audio
Vector Embeddings
Perform
Approximate
Nearest Neighbor
Similarity Search
Perform Query
Get Results
Store in Vector Database
How Similarity Search Works
15 | © Copyright 2024 Zilliz
15
Similarity Search
Solution: Deep Learning
16 | © Copyright 2024 Zilliz
16
Embeddings and Vector Spaces
17 | © Copyright Zilliz
17
03 Why?
18 | © Copyright Zilliz
18
• 50M100M vectors
• PostgreSQL, ElasticSearch, Big
Query, MongoDB, etc with
ANNS plug-ins
Existing Solutions Vector Databases
• Purpose-built for vectors top
support the requirements and
lifecycle of vectors
• Billion+ scale
• CRUD, real-time search,
top-k/range/hybrid search,
multi-modal, mulit-vector query,
distributed
• Semantic Search is core to your
business
ANN Libraries
• FAISS, ANNOY, HNSW
• Supports 1M vectors
• Good for prototyping
Vector Databases are purpose-built to
handle indexing, storing, and
querying
vector data.
Do you really need a Vector Database?
19
Why?
2024
2
0
Retrieval-Augmented Generation RAG
2024
A technique that combines the
strength of retrieval-based and
generative models:
● Improve accuracy and relevance
● Eliminate hallucination
● Provide domain-specific
knowledge
2
1
RAG : an Economic Perspective
2024
A business model that bridges public
data and private data
● Data sovereignty
● You can't and shouldn't give your
private data to others
22 | © Copyright Zilliz
22
23 | © Copyright Zilliz
23 | © Copyright Zilliz
23
Fast & Cost effective
3X faster, 3X
Cheaper
Pluggable Vector Search Lib
Tiered Storage
Scalable & Reliable
Cloud Native,
K8s Native
Scale from 1  10B
Storage / compute disaggregation
UNCOMPROMISING DATA
SECURITY
Enterprise Ready
Platform
Battle-Tested: Delivering Reliable
Performance and Enterprise-Grade
Security
AI Powered
Vector Native
Rich functionality for AI
Born for vector data processing
Thatʼs why we build Milvus And itʼs open sourced
under Apache license!
24 | © Copyright Zilliz
24
03 How
25 | © Copyright 2024 Zilliz
25
Milvus Vector
Database
Milvus is an open-source vector database
for GenAI projects. pip install on your
laptop, plug into popular AI dev tools, and
push to production with a single line of
code.
30K
GitHub Stars
66M
Docker Pulls
400
Contributors
2.7K
Forks
Easy Setup
Pip-install to start coding in a notebook within seconds
Integration
Plug into OpenAI, Langchain, LlmaIndex, and many more
Reusable Code
Write once, and deploy with one line of code into the production
environment
Feature-rich
Dense & sparse embeddings, filtering, reranking and beyond
26 | © Copyright 2024 Zilliz
26
New Challenge: Search in Vector Spaces
How to Index and
Search?
● High-dimensional
● > 1000 dims
How to Scale?
● 10-100 million vectors?
● Billions?
● Trillions?
● Billions of users?
Multiple Data Types?
● Text
● Images
● Audio
● Graphs
● …
27 | © Copyright Zilliz
27
Milvus 🤝 Open-Source
MINIO
Store Vectors and
Indexes
Enables Milvus’
stateless architecture
Kafka/ Pulsar
Handles Data Insertion
stream
Internal Component
Communications
Real-time updates to
Milvus
Prometheus /
Grafana
Collects metrics from
Milvus
Provides real-time
monitoring dashboards
Kubernetes
Milvus Operator
CRDs
28
Milvus Architecture
2024
29 | © Copyright Zilliz
29
Stateless Architecture
Stateless Components All Milvus components are deployed Stateless.
Object Storage
Milvus relies on Object Storage (MinIO, S3, etc) for data
persistence.
Vectors are stored in Object Storage, Metadata is in etcd.
Scaling and Failover
Scaling and failover don't involve traditional data rebalancing.
When new pods are added or existing ones fail, they can
immediately start handling requests by accessing data from the
shared object storage.
30 | © Copyright Zilliz
30
Distributed
Architecture
31 | © Copyright Zilliz
31
● Subscribe to the log broker for
real-time querying
● Convert new data into Growing
Segments - temporary in-memory
structures for the latest information.
● Access Sealed Segments from
object storage for comprehensive
searches.
● Perform hybrid searches
combining vector and scalar data for
accurate retrieval.
Query Node: Serving Search Requests
32 | © Copyright Zilliz
32
Image from Nvidia
Vector Search Overview
33 | © Copyright Zilliz
33
34 | © Copyright Zilliz
34
Easy Open RAG Stack Highlighted
Framework
Hardware
Infrastructure
Embedding Models LLMs
Software Infrastructure
Vector Database
35 | © Copyright Zilliz
35
DATA AND LLM
36 | © Copyright Zilliz
36
06 Q & A
37 | © Copyright Zilliz
37 | © Copyright Zilliz
37
RESOURCES
38 | © Copyright Zilliz
38
Vector Database Resources
Give Milvus a Star! Chat with me on Discord!
https://github.com/milvus-io/milvus
39
Unstructured Data Meetup
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics
such as vector databases, LLMs, and managing data at scale. The intended audience of this group
includes roles like machine learning engineers, data scientists, data engineers, software engineers, and
PMs.
This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
40 | © Copyright Zilliz
40
https://zilliz.com/learn/generative-ai
41 | © Copyright 2024 Zilliz
41
41
This week in Milvus, Towhee, Attu, GPT
Cache, Gen AI, LLM, Apache NiFi, Apache
Flink, Apache Kafka, ML, AI, Apache Spark,
Apache Iceberg, Python, Java, Vector DB
and Open Source friends.
https://bit.ly/32dAJft
https://github.com/milvus-io/milvus
AIM Weekly by Tim Spann
42 | © Copyright 2024 Zilliz
42
milvus.io
github.com/milvus-io/
@milvusio
@paasDev
/in/timothyspann
Connect with me!
Thank you!
43 | © Copyright 2024 Zilliz
43
44 | © Copyright Zilliz
44 | © Copyright Zilliz
44
The Forrester Wave™ Vector
Database Providers, Q3 2024
Zilliz is the right partner for
your Vector Database
needs.
45 | © Copyright Zilliz
45
T H A N K Y O U

10-25-2024_BITS_NYC_Unstructured Data and LLM_ What, Why and How

  • 1.
    1 | ©Copyright 2024 Zilliz 1 Unstructured Data and LLM What, Why and How Tim Spann @ Zilliz
  • 2.
  • 3.
    Who has used avector database?
  • 4.
  • 5.
    5 | ©Copyright Zilliz 5 01 Introduction
  • 6.
    6 | ©Copyright 2024 Zilliz 6 6 | © Copyright 10/22/23 Zilliz 6 | © Copyright 2024 Zilliz Tim Spann Principal Developer Advocate, Zilliz tim.spann@zilliz.com https://www.linkedin.com/in/timothyspann/ https://x.com/PaaSDev
  • 7.
    The challenge ofUnstructured Data ● Problem: Unstructured data comes in lots of forms, no easy way to interact with it all ● Solution: Vector embeddings ● How: Neural networks e.g. embedding models Vector Databases
  • 8.
  • 9.
    9 | ©Copyright Zilliz 9 02 What?
  • 10.
    10 | ©Copyright Zilliz 10
  • 11.
    Unstructured Data isEverywhere Unstructured data is any data that does not conform to a predefined data model. Currently, 90% of unstructured data is never analyzed. Images Videos and more! Text
  • 12.
    12 | ©Copyright 2024 Zilliz 12 10% Other newly generated data in 2025 will be unstructured data 90% Data Source: The Digitization of the World by IDC Why is Semantic Search so important?
  • 13.
    13 | ©Copyright 2024 Zilliz 13 or Apple or Rising dough or Change car tire Rising Dough Proofing Bread ✔ ❌ Semantic Search vs Lexical Search
  • 14.
    Vn, 1 … … … 1 2 3 4 5 Transform into Vectors UnstructuredData Images User Generated Content Video Documents Audio Vector Embeddings Perform Approximate Nearest Neighbor Similarity Search Perform Query Get Results Store in Vector Database How Similarity Search Works
  • 15.
    15 | ©Copyright 2024 Zilliz 15 Similarity Search Solution: Deep Learning
  • 16.
    16 | ©Copyright 2024 Zilliz 16 Embeddings and Vector Spaces
  • 17.
    17 | ©Copyright Zilliz 17 03 Why?
  • 18.
    18 | ©Copyright Zilliz 18 • 50M100M vectors • PostgreSQL, ElasticSearch, Big Query, MongoDB, etc with ANNS plug-ins Existing Solutions Vector Databases • Purpose-built for vectors top support the requirements and lifecycle of vectors • Billion+ scale • CRUD, real-time search, top-k/range/hybrid search, multi-modal, mulit-vector query, distributed • Semantic Search is core to your business ANN Libraries • FAISS, ANNOY, HNSW • Supports 1M vectors • Good for prototyping Vector Databases are purpose-built to handle indexing, storing, and querying vector data. Do you really need a Vector Database?
  • 19.
  • 20.
    2 0 Retrieval-Augmented Generation RAG 2024 Atechnique that combines the strength of retrieval-based and generative models: ● Improve accuracy and relevance ● Eliminate hallucination ● Provide domain-specific knowledge
  • 21.
    2 1 RAG : anEconomic Perspective 2024 A business model that bridges public data and private data ● Data sovereignty ● You can't and shouldn't give your private data to others
  • 22.
    22 | ©Copyright Zilliz 22
  • 23.
    23 | ©Copyright Zilliz 23 | © Copyright Zilliz 23 Fast & Cost effective 3X faster, 3X Cheaper Pluggable Vector Search Lib Tiered Storage Scalable & Reliable Cloud Native, K8s Native Scale from 1  10B Storage / compute disaggregation UNCOMPROMISING DATA SECURITY Enterprise Ready Platform Battle-Tested: Delivering Reliable Performance and Enterprise-Grade Security AI Powered Vector Native Rich functionality for AI Born for vector data processing Thatʼs why we build Milvus And itʼs open sourced under Apache license!
  • 24.
    24 | ©Copyright Zilliz 24 03 How
  • 25.
    25 | ©Copyright 2024 Zilliz 25 Milvus Vector Database Milvus is an open-source vector database for GenAI projects. pip install on your laptop, plug into popular AI dev tools, and push to production with a single line of code. 30K GitHub Stars 66M Docker Pulls 400 Contributors 2.7K Forks Easy Setup Pip-install to start coding in a notebook within seconds Integration Plug into OpenAI, Langchain, LlmaIndex, and many more Reusable Code Write once, and deploy with one line of code into the production environment Feature-rich Dense & sparse embeddings, filtering, reranking and beyond
  • 26.
    26 | ©Copyright 2024 Zilliz 26 New Challenge: Search in Vector Spaces How to Index and Search? ● High-dimensional ● > 1000 dims How to Scale? ● 10-100 million vectors? ● Billions? ● Trillions? ● Billions of users? Multiple Data Types? ● Text ● Images ● Audio ● Graphs ● …
  • 27.
    27 | ©Copyright Zilliz 27 Milvus 🤝 Open-Source MINIO Store Vectors and Indexes Enables Milvus’ stateless architecture Kafka/ Pulsar Handles Data Insertion stream Internal Component Communications Real-time updates to Milvus Prometheus / Grafana Collects metrics from Milvus Provides real-time monitoring dashboards Kubernetes Milvus Operator CRDs
  • 28.
  • 29.
    29 | ©Copyright Zilliz 29 Stateless Architecture Stateless Components All Milvus components are deployed Stateless. Object Storage Milvus relies on Object Storage (MinIO, S3, etc) for data persistence. Vectors are stored in Object Storage, Metadata is in etcd. Scaling and Failover Scaling and failover don't involve traditional data rebalancing. When new pods are added or existing ones fail, they can immediately start handling requests by accessing data from the shared object storage.
  • 30.
    30 | ©Copyright Zilliz 30 Distributed Architecture
  • 31.
    31 | ©Copyright Zilliz 31 ● Subscribe to the log broker for real-time querying ● Convert new data into Growing Segments - temporary in-memory structures for the latest information. ● Access Sealed Segments from object storage for comprehensive searches. ● Perform hybrid searches combining vector and scalar data for accurate retrieval. Query Node: Serving Search Requests
  • 32.
    32 | ©Copyright Zilliz 32 Image from Nvidia Vector Search Overview
  • 33.
    33 | ©Copyright Zilliz 33
  • 34.
    34 | ©Copyright Zilliz 34 Easy Open RAG Stack Highlighted Framework Hardware Infrastructure Embedding Models LLMs Software Infrastructure Vector Database
  • 35.
    35 | ©Copyright Zilliz 35 DATA AND LLM
  • 36.
    36 | ©Copyright Zilliz 36 06 Q & A
  • 37.
    37 | ©Copyright Zilliz 37 | © Copyright Zilliz 37 RESOURCES
  • 38.
    38 | ©Copyright Zilliz 38 Vector Database Resources Give Milvus a Star! Chat with me on Discord! https://github.com/milvus-io/milvus
  • 39.
    39 Unstructured Data Meetup https://www.meetup.com/unstructured-data-meetup-new-york/ Thismeetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs. This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
  • 40.
    40 | ©Copyright Zilliz 40 https://zilliz.com/learn/generative-ai
  • 41.
    41 | ©Copyright 2024 Zilliz 41 41 This week in Milvus, Towhee, Attu, GPT Cache, Gen AI, LLM, Apache NiFi, Apache Flink, Apache Kafka, ML, AI, Apache Spark, Apache Iceberg, Python, Java, Vector DB and Open Source friends. https://bit.ly/32dAJft https://github.com/milvus-io/milvus AIM Weekly by Tim Spann
  • 42.
    42 | ©Copyright 2024 Zilliz 42 milvus.io github.com/milvus-io/ @milvusio @paasDev /in/timothyspann Connect with me! Thank you!
  • 43.
    43 | ©Copyright 2024 Zilliz 43
  • 44.
    44 | ©Copyright Zilliz 44 | © Copyright Zilliz 44 The Forrester Wave™ Vector Database Providers, Q3 2024 Zilliz is the right partner for your Vector Database needs.
  • 45.
    45 | ©Copyright Zilliz 45 T H A N K Y O U