How to Optimize Your Embedding Model Selection and Development through TDA Clustering

How to Optimize Your
Embedding Model:
Selection and Development Through TDA
| January 2025

Who we are
Gunnar Carlsson, PhD
CoFounder and CTO
Pioneer in Topological Data Analysis
(TDA) and a Professor of Mathematics
(emeritus) at Stanford University.
gunnar.carlsson@bluelightai.com
Gabriel Alon
Senior Data Scientist
Developed BluelightAI’s clustering
method for evaluating, comparing,
and improving models for retrieval
tasks.
gabriel.alon@bluelightai.com
We believe that by giving
teams more visibility and
control over what drives
their models, we can all
achieve better outcomes.
Our core technology is
TDA, spawned from
groundbreaking research
by members of our team.

Agenda
● The Problem: How to evaluate and improve your AI using vector embeddings
● The Solution: Navigable clustering with Topological Data Analysis
● Case Study: ML Lifecycle and Case Studies in E-commerce

Which embedding model is best for you?
Source: Zilliz: “What is a Vector Database” Article

● Public Embedding Model Leaderboards (MTEB)?
● Hugging Face Downloads Count?
● Excitement on Twitter/X or Linkedin around a Model?
● Latest exciting research paper?
What could go wrong choosing
an embedding model?

● A Machine Learning model, usually an embedding model, transfers all diﬀerent
kinds of unstructured data into vector embeddings.
● Vector embeddings are stored in Zilliz cloud.
● Zilliz provides the key capabilities for operating with the embeddings eﬃciently.
Crucial Choice in Every Vector
Database Deployment: Selecting
an Embedding Model
Source: Zilliz: “What is a Vector Database” Article

Why Evaluate Embedding Models
on Your Own Data?
● Train-test mismatch is a universal problem in machine learning
● No guarantees an outside model will work on your custom data set
● Overﬁtting is a universal problem on public benchmarks, since the
data is public and people compete for fame

Benefits of Doing an Evaluation
● Sanity check the performance of your model
● Prevent costly and expensive mistakes (you can lose users)
● Conﬁdently choose or develop the best performing model for your use case

Current Evaluation Approaches
Performance Based Embedding Model Selection:
● Select the model that is best on average on your whole dataset
● It’s advised to look at individual queries to sanity check behavior

Performance Evaluation
Status Quo
The average performance
for each metric doesn’t tell you
much about speciﬁc queries!

Evaluating A Single Query
● Query Example: “Television Stands”
● Ground Truth Retrieved “Files”
[Mobile TV Cart, Universal TV Stand, TV Stand [2 feet]
● Predicted Retrieved “Files”
Model A [Mobile TV Cart, Universal TV Stand, Black Television]
Model B [Mobile TV Cart, Universal TV Stand, TV Stand [2 feet]
Evaluation score like Recall will be higher for Model B

Limitations of Current
Evaluation Approaches:
● Risky
Identifying where your model is performing poorly isn’t
addressed by taking an average on your whole dataset.
● Not Scalable
Looking at individual queries at a time to understand
and improve model performance isn’t scalable.

Why Navigable TDA Clustering?
● Understand the performance breakdown of any embedding model on your own data set
● Compare models on the clusters from your own data
● Improve or deploy your models with more precision

Clustering in Vector Databases
● Vector databases encode information about many things, including documents and
customer behaviors.
● In analyzing model behavior and failure, it is very useful to cluster the points in the
vector embedding
● Classes of queries, groups of customers
● Permits the identiﬁcation of systematic groups of failures, depending either on the
type of query or on the customer.
● Without clustering, one evaluates point by point, or averages over the whole data set

Clustering
● Divides data into groups
● There are many diﬀerent methods, none of which is the best in every case.
● K-means, single linkage, DBSCAN, Spectral, UMAP and t-SNE
● Need a way to do navigable clustering, with easy way to navigate between
choices of hyperparameters in clustering.

Clustering on Fine Food Reviews
https://github.com/openai/openai-cookbook/blob/main/examples/Get_embeddings_from_dataset.ipynb
https://cookbook.openai.com/examples/clustering
OpenAI
K Means+T-SNE K=4
Cobalt

Navigable TDA preserves more Structure in the Data
Clustering on Fine Food Reviews

Clustering
● Suppose our data consists of locations within the United States, parametrized by lat/long
● The natural clustering would divide the U.S. into four groups, namely points in mainland, points in
Alaska, and points in Hawaii, points in Puerto Rico
Caption
Caption
Caption
Caption

Clustering
Resolution
● This isn’t quite right, say the Aleutians or Guam might also form clusters, and Hawaii consists of
several islands
Caption Caption

Clustering
How important are particular clusters
● There are other questions beyond resolution that come up.
● For example, deciding what groupings are large or signiﬁcant enough to include is important
● For example, do we include Bird Rock oﬀ the Northern California Coast?

Clustering
Use other properties than lat/long
● The points in the U.S. is still the underlying set, but suppose that we want
to understand it only from the point of view of political preferences
● Then the clustering would contain two clusters, one red, one blue

Clustering
Use other properties than lat/long
● We might not just cluster by the two groups, but by regions with aﬃnity to each other
● Hawaii, West Coast, Blue Mountain states, Industrial upper midwest, Upper East Coast

Clustering
Optimization questions
● Often when we are doing optimization, we ﬁnd that there are more than one local max or min.
● We often want to understand that landscape, not just the absolute optimum.
● Local optima can be very important.
● Imagine we have a map of the US again, this time with information about occurrences of a
particular disease, together with their placement.
● Local optima are “hot spots”, each one is important to understand.

Michigan
Mississippi
Maine/Vermont
PA/WV
Clustering

● The hot spots are in this case Michigan, Vermont, Maine, Pennsylvania/West Virginia,
and Mississippi
● Likely there are diﬀerent reasons for these various “hot spots.” For example, the
presence of heavy industry in Michigan might explain their high value, but there would
likely be other explanations in Maine and Mississippi.
● For taking actions, we need to know this. We don’t simply want to ﬁnd the one state
with the highest rate, we need to understand all these hot spots.
Clustering

Navigable Clustering
What do we need to create this kind of clustering?
● Need some kind of a map describing underlying data, analogous to the lat/long map for the US.
● Then we use that map and produce heat maps on it for values of interest or clusterings based on it.
● This is what we do at BlueLightAI, for all kinds of data. We produce an appropriate map, even for
unstructured data, and allow you to construct heat maps or clusterings.
● We call it navigable clustering, in that you can vary the map in various ways including resolution.
Allows you to adapt your clustering to the problem at hand.
● Navigable clustering is part of topological data analysis (TDA), developed with DARPA and National
Science Foundation support at Stanford.

TDA Graph Neighbors as a Table

Case Study Data
100k subset of the Marqo-GS-10M dataset of 10 million queries+products from Google Shopping.
● Query Example: “Television Stands”
● Ground Truth are “Products” we want to retrieve from the Vector Database
[Mobile TV Cart, Universal TV Stand, TV Stand [2 feet]
● The performance metrics reward predicting both the presence and ranking of correct
products
-

Why Evaluate with Navigable
Clustering?
Navigable clustering helps you to identify clusters with
performance rates way below the average of 0.34!
Extremely
Low Scores
(Range 0-1)!

Navigable Clustering intelligently
illuminates problems in the model!
a. Queries to the vector database are clustered & summarized with
keywords in the “name” column
b. The keywords are in your data, and are based on distinctness

Navigable Clustering Output: Pandas Dataframe
a. Navigation by sorting or ﬁltering by column is easy
b. Here we sorted the table to ﬁnd groups of queries way below
the average performance of 0.34 for this E5 model (NDCG score)

Question: What if you trusted a popular model
and didn’t do an evaluation on your dataset?
Many types of ecommerce queries can perform poorly even on models like e5!
Performance is on a scale of 0 to 1 for this NDCG metric
Average performance on this model on the dataset was 0.34

Navigating to smaller clusters revealed performance rates close to zero!
Performance is on a scale of 0 to 1 for this NDCG metric. Average performance on this model on the dataset was 0.34
Navigable Clustering

Machine Learning Lifecycle
Model Comparisons

Model Comparisons
● E5 was better on average than SBERT (0.34>0.26 for NDCG)
● Yet many groups performed better in SBERT!

Model Comparisons
● Imagine evaluating a switch from E5 large to E5 small to save storage costs
● What clusters of queries represent the greatest performance sacriﬁces for your application?
*For illustration
E5 large E5 small

Fine Tuning

Monitoring Fine-Tuning
The E5 model averages a score of 0.35 (NDCG) and after ﬁne-tuning it had an average of 0.45
But Surprise! Many groups of queries perform worse on the ﬁne-tuned E5 model!

Epoch 1 Epoch 14
For Illustration
Time for a Fine-Tuning Intervention?

Post Deployment

Post-Deployment: E-commerce Case Study
● The E5 model is bad at espresso machine related queries and other clusters:
- Weigh the risk of promoting these products heavily through marketing
- Consider using a simpler alternative approach than an embedding for queries in this cluster
- If its a chatbot application, consider routing to a human in the loop

Broader Workflows
and Preparing Data

1. Public Embedding Model Leaderboards:
MTEB helps with identifying relevant models that meet your constraints
ie: size, speed, cost, performance on public datasets
2. Evaluate Average Performance on Your Own Data
3. Navigable TDA Clustering!
● Identify high and low model performance scenarios on your own dataset
● Easily compare models, ﬁne tune, or use your models with more precision
Workflow to Choose the Best Embedding Model

How does Navigable TDA Clustering work with
Milvus/Zilliz?
Do you have model evaluation data computed?
Yes: If you have it, simply pass it in to cobalt as a Pandas Dataframe
Not Yet:
Packages like BEIR and pytrec_eval can help with evaluating models on your data; we
have example notebooks available
Without ground truth data, you can still do TDA clustering

Annotated Ground Truth: Precision, Recall, NDCG, MRR
Live User Behavior Data: (Click-through Rate, Purchase Rate)
For each Vector Database Query, it’s a natural process to get a performance metric
calculated using predictions and ground truth!
Example Supported Performance Metrics

- Standard evaluations have at least one
performance score/metric for each query
- A Pandas DataFrame containing queries and performance scores (see below) is
ready for Navigable TDA.
What Data is Needed to Evaluate a
Retrieval Model?
Example Metrics

Input to BluelightAI Cobalt API:
Single Model Output:
At least one score
per query is needed
Each row is a cluster of
queries
Average score per
cluster
(calculated using
the score per
query table above)

Input to BluelightAI Cobalt API:
Output: Each row is a cluster of queries, and the average scores per model!
Repeat for
each Model

What if you don’t have evaluation scores?
Clustering can still help illuminate patterns of queries in your dataset
- Public Benchmarks like MTEB have evaluation scores
(though the data won’t be your own dataset, there’s some clues in the clusters there!)

Average performance was 0.34 on this dataset
Navigable Clustering revealed critical and actionable performance problems!
Concluding Thoughts

Why Navigable TDA Clustering with
BluelightAI Cobalt?
● Understand the performance breakdown of any embedding model on
your own dataset
● Compare models on the clusters from your data
● Improve your models or deploy your models with more precision

Resources
● Github: https://github.com/BlueLightAI
● Slack: https://bluelightai.com.slack.com/ssb/
● Documentation and Example Notebooks:
docs.cobalt.bluelightai.com/examples.html

58
Thank you!
gunnar.carlsson@bluelightai.com
gabriel.alon@bluelightai.com

How to Optimize Your Embedding Model Selection and Development through TDA Clustering

More Related Content

Similar to How to Optimize Your Embedding Model Selection and Development through TDA Clustering

More from Zilliz

Recently uploaded

How to Optimize Your Embedding Model Selection and Development through TDA Clustering