Architecture of a search engine

Architecture of a Search Engine
Paris Tech Talks #7 - April ’14
@sylvainutard - @algolia

• Today Search means Google
• Search is a daily activity
• Search is complex
• DB are (probably) not handling text queries
• Speed and relevance are keys
• Fuzzy matching: typos!
2
Search

• Databases
• Optimized for INSERT/UPDATE/DELETE/
SELECT (that's a lot)
• Strong query syntax (mostly SQL)
• Some operations scan all your documents
(missing index?)
3
Why Search engines?

• Search engines
• HIGHLY optimized for “SELECT” (only)
• Full-text queries: understand what is a word
• Query execution time driven by the number of
matching documents
• And obviously, “LIKE '%foo bar%’" is not full-
text search
4
Why Search engines?

5
Why Search engines?
Search
Push data
periodically or
in realtime
Full-text search
Primary storage
(DB, ﬁles, ...)
Search engine
Application

• Input = documents
• Composed by multiple attributes (textual,
numerical, geo)
• Output = documents
• Full-text query and/or numerical filters
• Understandable results: match score (ranking) +
highlighting
6
How it works

• 2 distinct processes
• Indexing: storing documents in a highly
optimized way to answer queries
• Query
• Matching documents
• Ranking matched documents
7
Implementation

• Indexing means building an “index“ or “inverted
lists“
• A dedicated data structure optimized for search
• Input = a set of documents containing words
• Output = a set of words associated to
documents
8
Implementation: Indexing process

9
Implementation: Indexing process
foo bar baz
Doc 1
bar foo
Doc 2
baz baz qux
Doc 3
foo
bar
baz
qux
Doc 1, Doc 2
Doc 1, Doc 3
Doc 1, Doc 2
Doc 3
Indexing
Inverted lists
Documents Index

• Queries
• Goal = Retrieve all documents matching a user
query
• Order results from the highest ranked to the
lowest
10
Implementation: Query process

11
foo
bar
baz
qux
Doc 1, Doc 2
Doc 1, Doc 3
Doc 1, Doc 2
Doc 3
Inverted lists
Index
User query "baz"
Sort matching
documents
Pagination
• 1-word query = inverted lists intersection

12
• N-words query = inverted lists intersection
foo
bar
baz
qux
Doc 1, Doc 2
Doc 1, Doc 3
Doc 1, Doc 2
Doc 3
Inverted lists
Index
User query "baz qux"
Sort matching
documents
Intersect inverted
lists
Pagination

• But how do you handle typing mistakes?
• Edit-distance algorithms (ex: Levenshtein)
!
• levenshtein(bar, baz) = 1 (substitution)!
• levenshtein(bar, br) = 1 (deletion)!
• levenshtein(bar, foobar) = 3 (addition)!
• Comparing a word with all known words
would be too costly
13

14
• The words dictionary is stored in a TRIE to enable
Levenshtein-based lookups (recursive-based traversal)
Doc 1 (pos=1, 3)
Doc 2 (pos=3)
Doc 1 (pos=2)
Doc 3 (pos=1)
Index
Doc 1 (pos=4)
Doc 3 (pos=2)
b c
a o
r z o
f

15
Example: faz
Doc 1 (pos=1, 3)
Doc 2 (pos=3)
Doc 1 (pos=2)
Doc 3 (pos=1)
Index
Doc 1 (pos=4)
Doc 3 (pos=2)
b c
a o
r z o
f
faz (distance=1)
faz (distance=0)
faz (distance=1)
faz (distance=1)
faz (distance=2) faz (distance=1)
faz (distance=2)
faz (distance=3)

• How are the matching documents ranked?
• Number of match occurrences? TF-IDF ?
• Numerical value reflecting popularity?
• Number of typing mistakes?
• Proximity between matched words?
• …
16

• What I didn’t speak about:
• Numerical/Geo queries (Including operators)
• Advanced query syntax (boolean operators, proximity
operators)
• Faceting & Aggregations (Categorization)
• Sharding (Horizontal scalability)
• Incremental indexing (Generational data structures)
• … (see u next time)
18
Missing subjects

Q/A
Now or later sylvain@algolia.com

Architecture of a search engine

More Related Content

What's hot

Viewers also liked

Similar to Architecture of a search engine

Recently uploaded

In this document

Architecture of a search engine