How to Optimize Your
Embedding Model:
Selection and Development Through TDA
| January 2025
Who we are
Gunnar Carlsson, PhD
CoFounder and CTO
Pioneer in Topological Data Analysis
(TDA) and a Professor of Mathematics
(emeritus) at Stanford University.
gunnar.carlsson@bluelightai.com
Gabriel Alon
Senior Data Scientist
Developed BluelightAI’s clustering
method for evaluating, comparing,
and improving models for retrieval
tasks.
gabriel.alon@bluelightai.com
We believe that by giving
teams more visibility and
control over what drives
their models, we can all
achieve better outcomes.
Our core technology is
TDA, spawned from
groundbreaking research
by members of our team.
Agenda
● The Problem: How to evaluate and improve your AI using vector embeddings
● The Solution: Navigable clustering with Topological Data Analysis
● Case Study: ML Lifecycle and Case Studies in E-commerce
The Problem
Which embedding model is best for you?
Source: Zilliz: “What is a Vector Database” Article
● Public Embedding Model Leaderboards (MTEB)?
● Hugging Face Downloads Count?
● Excitement on Twitter/X or Linkedin around a Model?
● Latest exciting research paper?
What could go wrong choosing
an embedding model?
● A Machine Learning model, usually an embedding model, transfers all different
kinds of unstructured data into vector embeddings.
● Vector embeddings are stored in Zilliz cloud.
● Zilliz provides the key capabilities for operating with the embeddings efficiently.
Crucial Choice in Every Vector
Database Deployment: Selecting
an Embedding Model
Source: Zilliz: “What is a Vector Database” Article
Why Evaluate Embedding Models
on Your Own Data?
● Train-test mismatch is a universal problem in machine learning
● No guarantees an outside model will work on your custom data set
● Overfitting is a universal problem on public benchmarks, since the
data is public and people compete for fame
Benefits of Doing an Evaluation
● Sanity check the performance of your model
● Prevent costly and expensive mistakes (you can lose users)
● Confidently choose or develop the best performing model for your use case
Current Evaluation Approaches
Performance Based Embedding Model Selection:
● Select the model that is best on average on your whole dataset
● It’s advised to look at individual queries to sanity check behavior
Performance Evaluation
Status Quo
The average performance
for each metric doesn’t tell you
much about specific queries!
Evaluating A Single Query
● Query Example: “Television Stands”
● Ground Truth Retrieved “Files”
[Mobile TV Cart, Universal TV Stand, TV Stand [2 feet]
● Predicted Retrieved “Files”
Model A [Mobile TV Cart, Universal TV Stand, Black Television]
Model B [Mobile TV Cart, Universal TV Stand, TV Stand [2 feet]
Evaluation score like Recall will be higher for Model B
Limitations of Current
Evaluation Approaches:
● Risky
Identifying where your model is performing poorly isn’t
addressed by taking an average on your whole dataset.
● Not Scalable
Looking at individual queries at a time to understand
and improve model performance isn’t scalable.
Why Navigable TDA Clustering?
● Understand the performance breakdown of any embedding model on your own data set
● Compare models on the clusters from your own data
● Improve or deploy your models with more precision
Our Solution
Clustering in Vector Databases
● Vector databases encode information about many things, including documents and
customer behaviors.
● In analyzing model behavior and failure, it is very useful to cluster the points in the
vector embedding
● Classes of queries, groups of customers
● Permits the identification of systematic groups of failures, depending either on the
type of query or on the customer.
● Without clustering, one evaluates point by point, or averages over the whole data set
Clustering
● Divides data into groups
● There are many different methods, none of which is the best in every case.
● K-means, single linkage, DBSCAN, Spectral, UMAP and t-SNE
● Need a way to do navigable clustering, with easy way to navigate between
choices of hyperparameters in clustering.
Clustering on Fine Food Reviews
https://github.com/openai/openai-cookbook/blob/main/examples/Get_embeddings_from_dataset.ipynb
https://cookbook.openai.com/examples/clustering
OpenAI
K Means+T-SNE K=4
Cobalt
Navigable TDA preserves more Structure in the Data
Clustering on Fine Food Reviews
t-SNE and UMAP
Clustering
● Suppose our data consists of locations within the United States, parametrized by lat/long
● The natural clustering would divide the U.S. into four groups, namely points in mainland, points in
Alaska, and points in Hawaii, points in Puerto Rico
Caption
Caption
Caption
Caption
Clustering
Resolution
● This isn’t quite right, say the Aleutians or Guam might also form clusters, and Hawaii consists of
several islands
Caption Caption
Clustering
How important are particular clusters
● There are other questions beyond resolution that come up.
● For example, deciding what groupings are large or significant enough to include is important
● For example, do we include Bird Rock off the Northern California Coast?
Clustering
Use other properties than lat/long
● The points in the U.S. is still the underlying set, but suppose that we want
to understand it only from the point of view of political preferences
● Then the clustering would contain two clusters, one red, one blue
Clustering
Use other properties than lat/long
● We might not just cluster by the two groups, but by regions with affinity to each other
● Hawaii, West Coast, Blue Mountain states, Industrial upper midwest, Upper East Coast
Clustering
Optimization questions
● Often when we are doing optimization, we find that there are more than one local max or min.
● We often want to understand that landscape, not just the absolute optimum.
● Local optima can be very important.
● Imagine we have a map of the US again, this time with information about occurrences of a
particular disease, together with their placement.
● Local optima are “hot spots”, each one is important to understand.
Michigan
Mississippi
Maine/Vermont
PA/WV
Clustering
Optimization questions
● The hot spots are in this case Michigan, Vermont, Maine, Pennsylvania/West Virginia,
and Mississippi
● Likely there are different reasons for these various “hot spots.” For example, the
presence of heavy industry in Michigan might explain their high value, but there would
likely be other explanations in Maine and Mississippi.
● For taking actions, we need to know this. We don’t simply want to find the one state
with the highest rate, we need to understand all these hot spots.
Clustering
Optimization questions
Navigable Clustering
What do we need to create this kind of clustering?
● Need some kind of a map describing underlying data, analogous to the lat/long map for the US.
● Then we use that map and produce heat maps on it for values of interest or clusterings based on it.
● This is what we do at BlueLightAI, for all kinds of data. We produce an appropriate map, even for
unstructured data, and allow you to construct heat maps or clusterings.
● We call it navigable clustering, in that you can vary the map in various ways including resolution.
Allows you to adapt your clustering to the problem at hand.
● Navigable clustering is part of topological data analysis (TDA), developed with DARPA and National
Science Foundation support at Stanford.
TDA Graph Neighbors as a Table
Case Study
Case Study Data
100k subset of the Marqo-GS-10M dataset of 10 million queries+products from Google Shopping.
● Query Example: “Television Stands”
● Ground Truth are “Products” we want to retrieve from the Vector Database
[Mobile TV Cart, Universal TV Stand, TV Stand [2 feet]
● The performance metrics reward predicting both the presence and ranking of correct
products
-
Evaluating A Single Query
● Query Example: “Television Stands”
● Ground Truth Retrieved “Files”
[Mobile TV Cart, Universal TV Stand, TV Stand [2 feet]
● Predicted Retrieved “Files”
Model A [Mobile TV Cart, Universal TV Stand, Black Television]
Model B [Mobile TV Cart, Universal TV Stand, TV Stand [2 feet]
Evaluation score like Recall will be higher for Model B
Why Evaluate with Navigable
Clustering?
Navigable clustering helps you to identify clusters with
performance rates way below the average of 0.34!
Extremely
Low Scores
(Range 0-1)!
Navigable Clustering intelligently
illuminates problems in the model!
a. Queries to the vector database are clustered & summarized with
keywords in the “name” column
b. The keywords are in your data, and are based on distinctness
Navigable Clustering Output: Pandas Dataframe
a. Navigation by sorting or filtering by column is easy
b. Here we sorted the table to find groups of queries way below
the average performance of 0.34 for this E5 model (NDCG score)
Question: What if you trusted a popular model
and didn’t do an evaluation on your dataset?
Many types of ecommerce queries can perform poorly even on models like e5!
Performance is on a scale of 0 to 1 for this NDCG metric
Average performance on this model on the dataset was 0.34
Navigating to smaller clusters revealed performance rates close to zero!
Performance is on a scale of 0 to 1 for this NDCG metric. Average performance on this model on the dataset was 0.34
Navigable Clustering
Machine Learning Lifecycle
Model Comparisons
Model Comparisons
● E5 was better on average than SBERT (0.34>0.26 for NDCG)
● Yet many groups performed better in SBERT!
Model Comparisons
● Imagine evaluating a switch from E5 large to E5 small to save storage costs
● What clusters of queries represent the greatest performance sacrifices for your application?
*For illustration
E5 large E5 small
Machine Learning Lifecycle
Fine Tuning
Monitoring Fine-Tuning
The E5 model averages a score of 0.35 (NDCG) and after fine-tuning it had an average of 0.45
But Surprise! Many groups of queries perform worse on the fine-tuned E5 model!
Epoch 1 Epoch 14
For Illustration
Time for a Fine-Tuning Intervention?
Machine Learning Lifecycle
Post Deployment
Post-Deployment: E-commerce Case Study
● The E5 model is bad at espresso machine related queries and other clusters:
- Weigh the risk of promoting these products heavily through marketing
- Consider using a simpler alternative approach than an embedding for queries in this cluster
- If its a chatbot application, consider routing to a human in the loop
Broader Workflows
and Preparing Data
1. Public Embedding Model Leaderboards:
MTEB helps with identifying relevant models that meet your constraints
ie: size, speed, cost, performance on public datasets
2. Evaluate Average Performance on Your Own Data
3. Navigable TDA Clustering!
● Identify high and low model performance scenarios on your own dataset
● Easily compare models, fine tune, or use your models with more precision
Workflow to Choose the Best Embedding Model
How does Navigable TDA Clustering work with
Milvus/Zilliz?
Do you have model evaluation data computed?
Yes: If you have it, simply pass it in to cobalt as a Pandas Dataframe
Not Yet:
Packages like BEIR and pytrec_eval can help with evaluating models on your data; we
have example notebooks available
Without ground truth data, you can still do TDA clustering
Annotated Ground Truth: Precision, Recall, NDCG, MRR
Live User Behavior Data: (Click-through Rate, Purchase Rate)
For each Vector Database Query, it’s a natural process to get a performance metric
calculated using predictions and ground truth!
Example Supported Performance Metrics
- Standard evaluations have at least one
performance score/metric for each query
- A Pandas DataFrame containing queries and performance scores (see below) is
ready for Navigable TDA.
What Data is Needed to Evaluate a
Retrieval Model?
Example Metrics
Input to BluelightAI Cobalt API:
Single Model Output:
At least one score
per query is needed
Each row is a cluster of
queries
Average score per
cluster
(calculated using
the score per
query table above)
Input to BluelightAI Cobalt API:
Output: Each row is a cluster of queries, and the average scores per model!
Repeat for
each Model
What if you don’t have evaluation scores?
Clustering can still help illuminate patterns of queries in your dataset
- Public Benchmarks like MTEB have evaluation scores
(though the data won’t be your own dataset, there’s some clues in the clusters there!)
Average performance was 0.34 on this dataset
Navigable Clustering revealed critical and actionable performance problems!
Concluding Thoughts
Why Navigable TDA Clustering with
BluelightAI Cobalt?
● Understand the performance breakdown of any embedding model on
your own dataset
● Compare models on the clusters from your data
● Improve your models or deploy your models with more precision
Resources
● Github: https://github.com/BlueLightAI
● Slack: https://bluelightai.com.slack.com/ssb/
● Documentation and Example Notebooks:
docs.cobalt.bluelightai.com/examples.html
58
Thank you!
gunnar.carlsson@bluelightai.com
gabriel.alon@bluelightai.com

How to Optimize Your Embedding Model Selection and Development through TDA Clustering

  • 1.
    How to OptimizeYour Embedding Model: Selection and Development Through TDA | January 2025
  • 2.
    Who we are GunnarCarlsson, PhD CoFounder and CTO Pioneer in Topological Data Analysis (TDA) and a Professor of Mathematics (emeritus) at Stanford University. gunnar.carlsson@bluelightai.com Gabriel Alon Senior Data Scientist Developed BluelightAI’s clustering method for evaluating, comparing, and improving models for retrieval tasks. gabriel.alon@bluelightai.com We believe that by giving teams more visibility and control over what drives their models, we can all achieve better outcomes. Our core technology is TDA, spawned from groundbreaking research by members of our team.
  • 3.
    Agenda ● The Problem:How to evaluate and improve your AI using vector embeddings ● The Solution: Navigable clustering with Topological Data Analysis ● Case Study: ML Lifecycle and Case Studies in E-commerce
  • 4.
  • 5.
    Which embedding modelis best for you? Source: Zilliz: “What is a Vector Database” Article
  • 6.
    ● Public EmbeddingModel Leaderboards (MTEB)? ● Hugging Face Downloads Count? ● Excitement on Twitter/X or Linkedin around a Model? ● Latest exciting research paper? What could go wrong choosing an embedding model?
  • 7.
    ● A MachineLearning model, usually an embedding model, transfers all different kinds of unstructured data into vector embeddings. ● Vector embeddings are stored in Zilliz cloud. ● Zilliz provides the key capabilities for operating with the embeddings efficiently. Crucial Choice in Every Vector Database Deployment: Selecting an Embedding Model Source: Zilliz: “What is a Vector Database” Article
  • 8.
    Why Evaluate EmbeddingModels on Your Own Data? ● Train-test mismatch is a universal problem in machine learning ● No guarantees an outside model will work on your custom data set ● Overfitting is a universal problem on public benchmarks, since the data is public and people compete for fame
  • 9.
    Benefits of Doingan Evaluation ● Sanity check the performance of your model ● Prevent costly and expensive mistakes (you can lose users) ● Confidently choose or develop the best performing model for your use case
  • 10.
    Current Evaluation Approaches PerformanceBased Embedding Model Selection: ● Select the model that is best on average on your whole dataset ● It’s advised to look at individual queries to sanity check behavior
  • 11.
    Performance Evaluation Status Quo Theaverage performance for each metric doesn’t tell you much about specific queries!
  • 12.
    Evaluating A SingleQuery ● Query Example: “Television Stands” ● Ground Truth Retrieved “Files” [Mobile TV Cart, Universal TV Stand, TV Stand [2 feet] ● Predicted Retrieved “Files” Model A [Mobile TV Cart, Universal TV Stand, Black Television] Model B [Mobile TV Cart, Universal TV Stand, TV Stand [2 feet] Evaluation score like Recall will be higher for Model B
  • 13.
    Limitations of Current EvaluationApproaches: ● Risky Identifying where your model is performing poorly isn’t addressed by taking an average on your whole dataset. ● Not Scalable Looking at individual queries at a time to understand and improve model performance isn’t scalable.
  • 14.
    Why Navigable TDAClustering? ● Understand the performance breakdown of any embedding model on your own data set ● Compare models on the clusters from your own data ● Improve or deploy your models with more precision
  • 15.
  • 16.
    Clustering in VectorDatabases ● Vector databases encode information about many things, including documents and customer behaviors. ● In analyzing model behavior and failure, it is very useful to cluster the points in the vector embedding ● Classes of queries, groups of customers ● Permits the identification of systematic groups of failures, depending either on the type of query or on the customer. ● Without clustering, one evaluates point by point, or averages over the whole data set
  • 17.
    Clustering ● Divides datainto groups ● There are many different methods, none of which is the best in every case. ● K-means, single linkage, DBSCAN, Spectral, UMAP and t-SNE ● Need a way to do navigable clustering, with easy way to navigate between choices of hyperparameters in clustering.
  • 18.
    Clustering on FineFood Reviews https://github.com/openai/openai-cookbook/blob/main/examples/Get_embeddings_from_dataset.ipynb https://cookbook.openai.com/examples/clustering OpenAI K Means+T-SNE K=4 Cobalt
  • 19.
    Navigable TDA preservesmore Structure in the Data Clustering on Fine Food Reviews
  • 20.
  • 21.
    Clustering ● Suppose ourdata consists of locations within the United States, parametrized by lat/long ● The natural clustering would divide the U.S. into four groups, namely points in mainland, points in Alaska, and points in Hawaii, points in Puerto Rico Caption Caption Caption Caption
  • 22.
    Clustering Resolution ● This isn’tquite right, say the Aleutians or Guam might also form clusters, and Hawaii consists of several islands Caption Caption
  • 23.
    Clustering How important areparticular clusters ● There are other questions beyond resolution that come up. ● For example, deciding what groupings are large or significant enough to include is important ● For example, do we include Bird Rock off the Northern California Coast?
  • 24.
    Clustering Use other propertiesthan lat/long ● The points in the U.S. is still the underlying set, but suppose that we want to understand it only from the point of view of political preferences ● Then the clustering would contain two clusters, one red, one blue
  • 25.
    Clustering Use other propertiesthan lat/long ● We might not just cluster by the two groups, but by regions with affinity to each other ● Hawaii, West Coast, Blue Mountain states, Industrial upper midwest, Upper East Coast
  • 26.
    Clustering Optimization questions ● Oftenwhen we are doing optimization, we find that there are more than one local max or min. ● We often want to understand that landscape, not just the absolute optimum. ● Local optima can be very important. ● Imagine we have a map of the US again, this time with information about occurrences of a particular disease, together with their placement. ● Local optima are “hot spots”, each one is important to understand.
  • 27.
  • 28.
    ● The hotspots are in this case Michigan, Vermont, Maine, Pennsylvania/West Virginia, and Mississippi ● Likely there are different reasons for these various “hot spots.” For example, the presence of heavy industry in Michigan might explain their high value, but there would likely be other explanations in Maine and Mississippi. ● For taking actions, we need to know this. We don’t simply want to find the one state with the highest rate, we need to understand all these hot spots. Clustering Optimization questions
  • 29.
    Navigable Clustering What dowe need to create this kind of clustering? ● Need some kind of a map describing underlying data, analogous to the lat/long map for the US. ● Then we use that map and produce heat maps on it for values of interest or clusterings based on it. ● This is what we do at BlueLightAI, for all kinds of data. We produce an appropriate map, even for unstructured data, and allow you to construct heat maps or clusterings. ● We call it navigable clustering, in that you can vary the map in various ways including resolution. Allows you to adapt your clustering to the problem at hand. ● Navigable clustering is part of topological data analysis (TDA), developed with DARPA and National Science Foundation support at Stanford.
  • 30.
  • 31.
  • 32.
    Case Study Data 100ksubset of the Marqo-GS-10M dataset of 10 million queries+products from Google Shopping. ● Query Example: “Television Stands” ● Ground Truth are “Products” we want to retrieve from the Vector Database [Mobile TV Cart, Universal TV Stand, TV Stand [2 feet] ● The performance metrics reward predicting both the presence and ranking of correct products -
  • 33.
    Evaluating A SingleQuery ● Query Example: “Television Stands” ● Ground Truth Retrieved “Files” [Mobile TV Cart, Universal TV Stand, TV Stand [2 feet] ● Predicted Retrieved “Files” Model A [Mobile TV Cart, Universal TV Stand, Black Television] Model B [Mobile TV Cart, Universal TV Stand, TV Stand [2 feet] Evaluation score like Recall will be higher for Model B
  • 34.
    Why Evaluate withNavigable Clustering? Navigable clustering helps you to identify clusters with performance rates way below the average of 0.34! Extremely Low Scores (Range 0-1)!
  • 35.
    Navigable Clustering intelligently illuminatesproblems in the model! a. Queries to the vector database are clustered & summarized with keywords in the “name” column b. The keywords are in your data, and are based on distinctness
  • 36.
    Navigable Clustering Output:Pandas Dataframe a. Navigation by sorting or filtering by column is easy b. Here we sorted the table to find groups of queries way below the average performance of 0.34 for this E5 model (NDCG score)
  • 37.
    Question: What ifyou trusted a popular model and didn’t do an evaluation on your dataset? Many types of ecommerce queries can perform poorly even on models like e5! Performance is on a scale of 0 to 1 for this NDCG metric Average performance on this model on the dataset was 0.34
  • 38.
    Navigating to smallerclusters revealed performance rates close to zero! Performance is on a scale of 0 to 1 for this NDCG metric. Average performance on this model on the dataset was 0.34 Navigable Clustering
  • 39.
  • 40.
    Model Comparisons ● E5was better on average than SBERT (0.34>0.26 for NDCG) ● Yet many groups performed better in SBERT!
  • 41.
    Model Comparisons ● Imagineevaluating a switch from E5 large to E5 small to save storage costs ● What clusters of queries represent the greatest performance sacrifices for your application? *For illustration E5 large E5 small
  • 42.
  • 43.
    Monitoring Fine-Tuning The E5model averages a score of 0.35 (NDCG) and after fine-tuning it had an average of 0.45 But Surprise! Many groups of queries perform worse on the fine-tuned E5 model!
  • 44.
    Epoch 1 Epoch14 For Illustration Time for a Fine-Tuning Intervention?
  • 45.
  • 46.
    Post-Deployment: E-commerce CaseStudy ● The E5 model is bad at espresso machine related queries and other clusters: - Weigh the risk of promoting these products heavily through marketing - Consider using a simpler alternative approach than an embedding for queries in this cluster - If its a chatbot application, consider routing to a human in the loop
  • 47.
  • 48.
    1. Public EmbeddingModel Leaderboards: MTEB helps with identifying relevant models that meet your constraints ie: size, speed, cost, performance on public datasets 2. Evaluate Average Performance on Your Own Data 3. Navigable TDA Clustering! ● Identify high and low model performance scenarios on your own dataset ● Easily compare models, fine tune, or use your models with more precision Workflow to Choose the Best Embedding Model
  • 49.
    How does NavigableTDA Clustering work with Milvus/Zilliz? Do you have model evaluation data computed? Yes: If you have it, simply pass it in to cobalt as a Pandas Dataframe Not Yet: Packages like BEIR and pytrec_eval can help with evaluating models on your data; we have example notebooks available Without ground truth data, you can still do TDA clustering
  • 50.
    Annotated Ground Truth:Precision, Recall, NDCG, MRR Live User Behavior Data: (Click-through Rate, Purchase Rate) For each Vector Database Query, it’s a natural process to get a performance metric calculated using predictions and ground truth! Example Supported Performance Metrics
  • 51.
    - Standard evaluationshave at least one performance score/metric for each query - A Pandas DataFrame containing queries and performance scores (see below) is ready for Navigable TDA. What Data is Needed to Evaluate a Retrieval Model? Example Metrics
  • 52.
    Input to BluelightAICobalt API: Single Model Output: At least one score per query is needed Each row is a cluster of queries Average score per cluster (calculated using the score per query table above)
  • 53.
    Input to BluelightAICobalt API: Output: Each row is a cluster of queries, and the average scores per model! Repeat for each Model
  • 54.
    What if youdon’t have evaluation scores? Clustering can still help illuminate patterns of queries in your dataset - Public Benchmarks like MTEB have evaluation scores (though the data won’t be your own dataset, there’s some clues in the clusters there!)
  • 55.
    Average performance was0.34 on this dataset Navigable Clustering revealed critical and actionable performance problems! Concluding Thoughts
  • 56.
    Why Navigable TDAClustering with BluelightAI Cobalt? ● Understand the performance breakdown of any embedding model on your own dataset ● Compare models on the clusters from your data ● Improve your models or deploy your models with more precision
  • 57.
    Resources ● Github: https://github.com/BlueLightAI ●Slack: https://bluelightai.com.slack.com/ssb/ ● Documentation and Example Notebooks: docs.cobalt.bluelightai.com/examples.html
  • 58.