Architecture of a Search Engine
Paris Tech Talks #7 - April ’14
@sylvainutard - @algolia
• Today Search means Google
• Search is a daily activity
• Search is complex
• DB are (probably) not handling text queries
• Speed and relevance are keys
• Fuzzy matching: typos!
2
Search
• Databases
• Optimized for INSERT/UPDATE/DELETE/
SELECT (that's a lot)
• Strong query syntax (mostly SQL)
• Some operations scan all your documents
(missing index?)
3
Why Search engines?
• Search engines
• HIGHLY optimized for “SELECT” (only)
• Full-text queries: understand what is a word
• Query execution time driven by the number of
matching documents
• And obviously, “LIKE '%foo bar%’" is not full-
text search
4
Why Search engines?
5
Why Search engines?
Search
Push data
periodically or
in realtime
Full-text search
Primary storage
(DB, files, ...)
Search engine
Application
• Input = documents
• Composed by multiple attributes (textual,
numerical, geo)
• Output = documents
• Full-text query and/or numerical filters
• Understandable results: match score (ranking) +
highlighting
6
How it works
• 2 distinct processes
• Indexing: storing documents in a highly
optimized way to answer queries
• Query
• Matching documents
• Ranking matched documents
7
Implementation
• Indexing means building an “index“ or “inverted
lists“
• A dedicated data structure optimized for search
• Input = a set of documents containing words
• Output = a set of words associated to
documents
8
Implementation: Indexing process
9
Implementation: Indexing process
foo bar baz
Doc 1
bar foo
Doc 2
baz baz qux
Doc 3
foo
bar
baz
qux
Doc 1, Doc 2
Doc 1, Doc 3
Doc 1, Doc 2
Doc 3
Indexing
Inverted lists
Documents Index
• Queries
• Goal = Retrieve all documents matching a user
query
• Order results from the highest ranked to the
lowest
10
Implementation: Query process
11
Implementation: Query process
foo
bar
baz
qux
Doc 1, Doc 2
Doc 1, Doc 3
Doc 1, Doc 2
Doc 3
Inverted lists
Index
User query "baz"
Sort matching
documents
Pagination
• 1-word query = inverted lists intersection
12
Implementation: Query process
• N-words query = inverted lists intersection
foo
bar
baz
qux
Doc 1, Doc 2
Doc 1, Doc 3
Doc 1, Doc 2
Doc 3
Inverted lists
Index
User query "baz qux"
Sort matching
documents
Intersect inverted
lists
Pagination
• But how do you handle typing mistakes?
• Edit-distance algorithms (ex: Levenshtein)
!
• levenshtein(bar, baz) = 1 (substitution)!
• levenshtein(bar, br) = 1 (deletion)!
• levenshtein(bar, foobar) = 3 (addition)!
• Comparing a word with all known words
would be too costly
13
Implementation: Query process
14
Implementation: Query process
• The words dictionary is stored in a TRIE to enable
Levenshtein-based lookups (recursive-based traversal)
Doc 1 (pos=1, 3)
Doc 2 (pos=3)
Doc 1 (pos=2)
Doc 3 (pos=1)
Index
Doc 1 (pos=4)
Doc 3 (pos=2)
b c
a o
r z o
f
15
Implementation: Query process
Example: faz
Doc 1 (pos=1, 3)
Doc 2 (pos=3)
Doc 1 (pos=2)
Doc 3 (pos=1)
Index
Doc 1 (pos=4)
Doc 3 (pos=2)
b c
a o
r z o
f
faz (distance=1)
faz (distance=0)
faz (distance=1)
faz (distance=1)
faz (distance=2) faz (distance=1)
faz (distance=2)
faz (distance=3)
• How are the matching documents ranked?
• Number of match occurrences? TF-IDF ?
• Numerical value reflecting popularity?
• Number of typing mistakes?
• Proximity between matched words?
• …
16
Implementation: Query process
17
Several implementations
• What I didn’t speak about:
• Numerical/Geo queries (Including operators)
• Advanced query syntax (boolean operators, proximity
operators)
• Faceting & Aggregations (Categorization)
• Sharding (Horizontal scalability)
• Incremental indexing (Generational data structures)
• … (see u next time)
18
Missing subjects
Q/A
Now or later sylvain@algolia.com

Architecture of a search engine

  • 1.
    Architecture of aSearch Engine Paris Tech Talks #7 - April ’14 @sylvainutard - @algolia
  • 2.
    • Today Searchmeans Google • Search is a daily activity • Search is complex • DB are (probably) not handling text queries • Speed and relevance are keys • Fuzzy matching: typos! 2 Search
  • 3.
    • Databases • Optimizedfor INSERT/UPDATE/DELETE/ SELECT (that's a lot) • Strong query syntax (mostly SQL) • Some operations scan all your documents (missing index?) 3 Why Search engines?
  • 4.
    • Search engines •HIGHLY optimized for “SELECT” (only) • Full-text queries: understand what is a word • Query execution time driven by the number of matching documents • And obviously, “LIKE '%foo bar%’" is not full- text search 4 Why Search engines?
  • 5.
    5 Why Search engines? Search Pushdata periodically or in realtime Full-text search Primary storage (DB, files, ...) Search engine Application
  • 6.
    • Input =documents • Composed by multiple attributes (textual, numerical, geo) • Output = documents • Full-text query and/or numerical filters • Understandable results: match score (ranking) + highlighting 6 How it works
  • 7.
    • 2 distinctprocesses • Indexing: storing documents in a highly optimized way to answer queries • Query • Matching documents • Ranking matched documents 7 Implementation
  • 8.
    • Indexing meansbuilding an “index“ or “inverted lists“ • A dedicated data structure optimized for search • Input = a set of documents containing words • Output = a set of words associated to documents 8 Implementation: Indexing process
  • 9.
    9 Implementation: Indexing process foobar baz Doc 1 bar foo Doc 2 baz baz qux Doc 3 foo bar baz qux Doc 1, Doc 2 Doc 1, Doc 3 Doc 1, Doc 2 Doc 3 Indexing Inverted lists Documents Index
  • 10.
    • Queries • Goal= Retrieve all documents matching a user query • Order results from the highest ranked to the lowest 10 Implementation: Query process
  • 11.
    11 Implementation: Query process foo bar baz qux Doc1, Doc 2 Doc 1, Doc 3 Doc 1, Doc 2 Doc 3 Inverted lists Index User query "baz" Sort matching documents Pagination • 1-word query = inverted lists intersection
  • 12.
    12 Implementation: Query process •N-words query = inverted lists intersection foo bar baz qux Doc 1, Doc 2 Doc 1, Doc 3 Doc 1, Doc 2 Doc 3 Inverted lists Index User query "baz qux" Sort matching documents Intersect inverted lists Pagination
  • 13.
    • But howdo you handle typing mistakes? • Edit-distance algorithms (ex: Levenshtein) ! • levenshtein(bar, baz) = 1 (substitution)! • levenshtein(bar, br) = 1 (deletion)! • levenshtein(bar, foobar) = 3 (addition)! • Comparing a word with all known words would be too costly 13 Implementation: Query process
  • 14.
    14 Implementation: Query process •The words dictionary is stored in a TRIE to enable Levenshtein-based lookups (recursive-based traversal) Doc 1 (pos=1, 3) Doc 2 (pos=3) Doc 1 (pos=2) Doc 3 (pos=1) Index Doc 1 (pos=4) Doc 3 (pos=2) b c a o r z o f
  • 15.
    15 Implementation: Query process Example:faz Doc 1 (pos=1, 3) Doc 2 (pos=3) Doc 1 (pos=2) Doc 3 (pos=1) Index Doc 1 (pos=4) Doc 3 (pos=2) b c a o r z o f faz (distance=1) faz (distance=0) faz (distance=1) faz (distance=1) faz (distance=2) faz (distance=1) faz (distance=2) faz (distance=3)
  • 16.
    • How arethe matching documents ranked? • Number of match occurrences? TF-IDF ? • Numerical value reflecting popularity? • Number of typing mistakes? • Proximity between matched words? • … 16 Implementation: Query process
  • 17.
  • 18.
    • What Ididn’t speak about: • Numerical/Geo queries (Including operators) • Advanced query syntax (boolean operators, proximity operators) • Faceting & Aggregations (Categorization) • Sharding (Horizontal scalability) • Incremental indexing (Generational data structures) • … (see u next time) 18 Missing subjects
  • 19.
    Q/A Now or latersylvain@algolia.com