xavier@trovit.com

SPELLCHECKING IN TROVIT: IMPLEMENTING A
CONTEXTUAL MULTI-LANGUAGE
SPELLCHECKER FOR CLASSIFIED ADS
Xavier Sanchez Loro
R&D Engineer
Outline
• 
• 
• 
• 
• 
• 
• 
• 

Introduction
Our approach: Contextual Spellchecking
Nature and characteristics of our document corpus
Spellcheckers in Solr
White-listing and purging: controlling dictionary data
Spellchecker configuration
Customizing Solr’s SpellcheckComponent
Conclusions and Future Work
Supporting text for this speech

Trovit Engineering Blog post on spellchecking
http://tech.trovit.com/index.php/spellchecking-in-trovit/
INTRODUCTION
Introduction

Trovit: a search engine for classified ads
Introduction
Introduction: spellchecking in Trovit
•  Multi-language spellchecking system using SOLR and Lucene
•  Objectives:
–  help our users to better find the desired ads
–  avoid the dreaded 0 results as much as possible
–  Our goal is not pure orthographic correction but also to
suggest correct searches for a certain site.
OUR APPROACH: CONTEXTUAL
SPELLCHECKING
Contextual Spellchecking: approach
• 

The Key element in the spellchecking process is choosing the right
dictionary
–  one with a relevant vocabulary
•  according to the type of information included in each site.

• 

Approach
–  Specializing the dictionaries based on user’s search context.

• 

Search contexts are composed of:
–  country (with a default language)
–  vertical (determining the type of ads and vocabulary).
Contextual Spellchecking: vocabularies
• 

Each site’s document corpus has a limited vocabulary
–  reduced to the type of information, language and terms included in each site’s
ads.

• 

Using a more generalized approach is not suitable for our needs
–  One vocabulary for each language less precise than specialized vocabularies
for each site.
–  Drastic differences
•  type of terms
•  semantics of each vertical.
–  Terms that are relevant in one context are meaningless in another one

• 

Different vocabularies for each site, even when supporting the same language.
–  Vocabulary is tailored according to context of searches
NATURE AND CHARACTERISTICS
OF OUR DOCUMENT CORPUS
Challenges: Inconsistencies in our corpus
• 

Document corpus is fed by different third-party sources
–  providing the ads for the different sites.

• 

We can detect incorrect documents and reconcile certain inconsistences
–  But we cannot control or modify the content of the ads themselves.
Inconsistencies
–  hinder any language detection process
–  pose challenges to the development of the spellchecking system

• 
Inconsistencies example
• 

Spanish homes vertical
–  not fully written in Spanish
–  Ads in several languages.
•  native languages: Spanish, Catalan, Basque and Galician.
•  foreign languages: English, German, French, Italian, Russian… even
oriental languages like Chinese!
•  Multi-language ads
–  badly written and misspelled words
•  Spanish words badly translated from regional languages
•  overtly misspelled words
–  e.g. “picina” yields a 1197 docs Vs 1048434 of “piscina”, 0.01%
–  “noisy” content
•  numbers, postal codes, references, etc.
Characteristics of our ads
• 

Summarizing
–  Segmented corpus in different indexes, one per country plus vertical (site)
–  3rd party generated
–  Ads in national language + other languages (regional and foreign)
–  Multi-language content in ads
–  Noisy content (numbers, references, postal codes, etc.)
–  Small texts (around 3000 characters long)
–  Misspellings and incorrect words

Corpus unreliable for use as the knowledge base to build any
spellchecking dictionary.
What/Where search segmentation

geolocation data is not
mixed with vertical data

geolocation data
interleaved with vertical
data

Only vertical data (no
geodata)
• 
Narrower
dictionary, less
collisons, more
controlable

Cover all geodata
• 
Wider dictionary,
more collisons,
less controlable
SPELLCHECKERS IN SOLR
IndexBasedSpellchecker
• 

• 

It creates a parallel index for the spelling dictionary that is based on an
existing Lucene index.
–  Depends on index data correctness (misspells)
–  Creates additional index from current index (small, MB)
–  Supports term frequency parameters
–  Must (re)build
Even though this component behaves as expected
–  it was of no use for Trovit’s use case.
IndexBasedSpellchecker
• 
• 

• 

It depends on index data
–  not an accurate and reliable for the spellchecking dictionary.
Continuous builds
–  synchronicity between index data and spelling index data.
–  If not
•  frequency information and hit counting are neither reliable nor
accurate.
•  false positives/negatives
•  suggestions of words with different number of hits, even 0.
We cannot risk suffering this situation
FileBasedSpellChecker
• 

It uses a flat file to generate a spelling dictionary in the form of a Lucene
spellchecking index.
–  Requires a dictionary file
–  Creates additional index from dictionary file (small, MB)
–  Does not depend on index data (controlled data)
–  Build once
•  rebuild only if dictionary is updated
–  No frequency information used when calculating spelling suggestions
FileBasedSpellChecker
• 
• 

• 

Requires rebuilds also
–  albeit less frequently
No frequency related data
–  Pure orthographic correction is not our main goal
–  We cannot risk suggesting corrections without results.
But
–  insight on how to approach the final solution we are implementing.
–  allows the highest degree of control in dictionary contents
•  essential feature for spelling dictionaries.
DirectSpellChecker
• 

Experimental spellchecker that just uses the main Solr index directly
–  Build/rebuild is not required.
–  Depends on index data correctness (misspells)
–  Uses existing index
•  field: source of the spelling dictionary.
–  Supports term frequency parameters.
–  No (re)build.

• 

Several promising features
–  No build + continuously in sync with index data.
–  Provides accurate frequency information data.
DirectSpellChecker
• 
• 

The real drawback
–  lack of control over index data sourcing the spelling dictionary.
If we can overcome it, this type would make an ideal candidate for our
use case.
WordBreakSpellChecker
• 

Generates suggestions by combining adjacent words and/or breaking
words into multiples.
–  This spellchecker can be configured with a traditional checker
(ie:DirectSolrSpellChecker).
–  The results are combined and collations can contain a mix of
corrections from both spellcheckers.
–  Uses existing index. No build.
WordBreakSpellChecker
• 
• 
• 
• 

Good complement to the other spellcheckers
It works really well with well-written concatenated words
–  it is able to break them up with great accuracy.
Combining split words is not as accurate
Drawback: it’s based on index data.
WHITE-LISTING AND PURGING:
CONTROLLING DICTIONARY
DATA
White-listing
• 
• 
• 

Any spelling system can only be as good as its knowledge base or
dictionary is accurate.
We need to control the data indexed as dictionary.
White-listing approach
–  we only index spelling data contained in a controlled dictionary list.
–  processes to build a base dictionary specialized for a given site.
White-list building process
SPELLCHECKER CONFIGURATION
Initial spellchecker configuration
• 

• 

DirectSpellChecker using purged spell field
–  Spell field filled with purged content
•  Purging according to whitelist
•  Whitelist generated from matching dictionary with index words, after
purge process
Benefits:
–  Build is no longer required.
–  Spell field is automatically updated via pipeline.
–  We can work with term freq.
–  No additional index, just an additional field.
–  Better relevance and suggestions.
Initial spellchecker configuration
• 
• 

Cons:
–  Whitelist maintenance and creation for new sites.
Features:
–  Accurate detection of misspelled words.
–  Good detection of concatenated words.
•  piscinagarajejardin to piscina garaje jardin
•  picina garajejardin to piscina (garaje jardin)
–  Able to detect several misspelled words.
–  Evolution based on whitelisting fine-tuning.
Initial spellchecker configuration
• 

Issues:
–  False negatives: suggestion of corrections when words are correctly spelled.
–  Suggestions for all the words in the query, not just those misspelled words.
–  Misguiding “correctlySpelled” parameter.
•  Parameter dependant on frequency information, making it unreliable for
our purposes.
•  It returns true/false according to thresholds,
–  not really depending on word distance but
–  results found, “alternativeTermCount” and “maxResultsForSuggest”
thresholds.
–  Minor discrepancies if we only index boosted terms (i.e. qf)
•  # hits spell< #docs index
CUSTOMIZING SOLR
SPELLCHECKCOMPONENT
Hacking SpellcheckComponent
• 

Lack of reliability of the “correctlySpelled” parameter
–  Difficult to know when give a suggestion or not.
–  First policy based on document hits
•  sliding window
–  based on the number of queried terms
•  the longer the tail, the smaller the threshold
•  inaccurate and prone to collisions.
–  Difficult to set up thresholds to a good level of accuracy.

We needed a more reliable way.
Hacking SpellcheckComponent: correctlySpelled
parameter behaviour
• 

Binary approach to deciding if a word is correctly spelled or not.

• 

Simpler approach
–  any term that appears in our spelling field is a correctly spelled word
•  regardless the value of its frequency info or the configured thresholds.
–  this way the parameter can be used to control when to start querying the
spellchecking index.
Hacking SpellcheckComponent
• 

Other changes to the SpellcheckComponent:
–  No suggestions when words are correctly spelled.
–  Only makes suggestions for misspelled words, not for all words
•  i.e. piscina garage -> piscina garaje

• 

Spanish-friendly ASCIIFoldingFilter
–  modified in order to not fold “ñ” (for Spanish) and “ç” (for Catalan names)
characters.
•  Avoids collisions with similar words with “n” and “c”
–  e.g. “pena” and “peña”
–  Still folding accented vowels
•  usually omitted by users.
CONCLUSIONS AND FUTURE
WORK
Conclusion & Future Work
• 

• 
• 

Base code
–  expand the spellchecking process to other sites
–  design final policy to decide when giving suggestions or not.
Geodata in homes verticals
–  find ways to avoid collisions in large dictionary sets.
Scoring system for spelling dictionary
–  Control suggestions based on user input
•  Feedback on relevance or quality of our spellchecking suggestions.
•  System more accurate and reliable
•  Expand whitelists to cover large amounts of geodata
–  with acceptable levels of precision.
Conclusion & Future Work
• 

Plural suggester
–  suggest alternative searches and corrections using plural or singular variants
of the terms in the query.
–  Use frequency and scoring information to choose most suitable suggestions.
THANKS FOR YOUR ATTENTION!
ANY QUESTIONS?
References
[1] Lucene/Solr Revolution EU 2013. Dublin, 6-7 November 2013.
http://www.lucenerevolution.org/
[2] Trovit – A search engine for classified ads of real estate, jobs, cars and vacation
rentals. http://www.trovit.com
[3] Apache Software Foundation. “Apache Solr” https://lucene.apache.org/solr/
[4] Apache Software Foundation. “Apache Lucene” https://lucene.apache.org
[5] Apache Software Foundation. “Spell Checking – Apache Solr Reference Guide –
Apache Software Foundation”
https://cwiki.apache.org/confluence/display/solr/Spell+Checking

Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

  • 2.
    xavier@trovit.com SPELLCHECKING IN TROVIT:IMPLEMENTING A CONTEXTUAL MULTI-LANGUAGE SPELLCHECKER FOR CLASSIFIED ADS Xavier Sanchez Loro R&D Engineer
  • 3.
    Outline •  •  •  •  •  •  •  •  Introduction Our approach: ContextualSpellchecking Nature and characteristics of our document corpus Spellcheckers in Solr White-listing and purging: controlling dictionary data Spellchecker configuration Customizing Solr’s SpellcheckComponent Conclusions and Future Work
  • 4.
    Supporting text forthis speech Trovit Engineering Blog post on spellchecking http://tech.trovit.com/index.php/spellchecking-in-trovit/
  • 5.
  • 6.
    Introduction Trovit: a searchengine for classified ads
  • 7.
  • 8.
    Introduction: spellchecking inTrovit •  Multi-language spellchecking system using SOLR and Lucene •  Objectives: –  help our users to better find the desired ads –  avoid the dreaded 0 results as much as possible –  Our goal is not pure orthographic correction but also to suggest correct searches for a certain site.
  • 9.
  • 10.
    Contextual Spellchecking: approach •  TheKey element in the spellchecking process is choosing the right dictionary –  one with a relevant vocabulary •  according to the type of information included in each site. •  Approach –  Specializing the dictionaries based on user’s search context. •  Search contexts are composed of: –  country (with a default language) –  vertical (determining the type of ads and vocabulary).
  • 11.
    Contextual Spellchecking: vocabularies •  Eachsite’s document corpus has a limited vocabulary –  reduced to the type of information, language and terms included in each site’s ads. •  Using a more generalized approach is not suitable for our needs –  One vocabulary for each language less precise than specialized vocabularies for each site. –  Drastic differences •  type of terms •  semantics of each vertical. –  Terms that are relevant in one context are meaningless in another one •  Different vocabularies for each site, even when supporting the same language. –  Vocabulary is tailored according to context of searches
  • 12.
    NATURE AND CHARACTERISTICS OFOUR DOCUMENT CORPUS
  • 13.
    Challenges: Inconsistencies inour corpus •  Document corpus is fed by different third-party sources –  providing the ads for the different sites. •  We can detect incorrect documents and reconcile certain inconsistences –  But we cannot control or modify the content of the ads themselves. Inconsistencies –  hinder any language detection process –  pose challenges to the development of the spellchecking system • 
  • 14.
    Inconsistencies example •  Spanish homesvertical –  not fully written in Spanish –  Ads in several languages. •  native languages: Spanish, Catalan, Basque and Galician. •  foreign languages: English, German, French, Italian, Russian… even oriental languages like Chinese! •  Multi-language ads –  badly written and misspelled words •  Spanish words badly translated from regional languages •  overtly misspelled words –  e.g. “picina” yields a 1197 docs Vs 1048434 of “piscina”, 0.01% –  “noisy” content •  numbers, postal codes, references, etc.
  • 15.
    Characteristics of ourads •  Summarizing –  Segmented corpus in different indexes, one per country plus vertical (site) –  3rd party generated –  Ads in national language + other languages (regional and foreign) –  Multi-language content in ads –  Noisy content (numbers, references, postal codes, etc.) –  Small texts (around 3000 characters long) –  Misspellings and incorrect words Corpus unreliable for use as the knowledge base to build any spellchecking dictionary.
  • 16.
    What/Where search segmentation geolocationdata is not mixed with vertical data geolocation data interleaved with vertical data Only vertical data (no geodata) •  Narrower dictionary, less collisons, more controlable Cover all geodata •  Wider dictionary, more collisons, less controlable
  • 17.
  • 18.
    IndexBasedSpellchecker •  •  It creates aparallel index for the spelling dictionary that is based on an existing Lucene index. –  Depends on index data correctness (misspells) –  Creates additional index from current index (small, MB) –  Supports term frequency parameters –  Must (re)build Even though this component behaves as expected –  it was of no use for Trovit’s use case.
  • 19.
    IndexBasedSpellchecker •  •  •  It depends onindex data –  not an accurate and reliable for the spellchecking dictionary. Continuous builds –  synchronicity between index data and spelling index data. –  If not •  frequency information and hit counting are neither reliable nor accurate. •  false positives/negatives •  suggestions of words with different number of hits, even 0. We cannot risk suffering this situation
  • 20.
    FileBasedSpellChecker •  It uses aflat file to generate a spelling dictionary in the form of a Lucene spellchecking index. –  Requires a dictionary file –  Creates additional index from dictionary file (small, MB) –  Does not depend on index data (controlled data) –  Build once •  rebuild only if dictionary is updated –  No frequency information used when calculating spelling suggestions
  • 21.
    FileBasedSpellChecker •  •  •  Requires rebuilds also – albeit less frequently No frequency related data –  Pure orthographic correction is not our main goal –  We cannot risk suggesting corrections without results. But –  insight on how to approach the final solution we are implementing. –  allows the highest degree of control in dictionary contents •  essential feature for spelling dictionaries.
  • 22.
    DirectSpellChecker •  Experimental spellchecker thatjust uses the main Solr index directly –  Build/rebuild is not required. –  Depends on index data correctness (misspells) –  Uses existing index •  field: source of the spelling dictionary. –  Supports term frequency parameters. –  No (re)build. •  Several promising features –  No build + continuously in sync with index data. –  Provides accurate frequency information data.
  • 23.
    DirectSpellChecker •  •  The real drawback – lack of control over index data sourcing the spelling dictionary. If we can overcome it, this type would make an ideal candidate for our use case.
  • 24.
    WordBreakSpellChecker •  Generates suggestions bycombining adjacent words and/or breaking words into multiples. –  This spellchecker can be configured with a traditional checker (ie:DirectSolrSpellChecker). –  The results are combined and collations can contain a mix of corrections from both spellcheckers. –  Uses existing index. No build.
  • 25.
    WordBreakSpellChecker •  •  •  •  Good complement tothe other spellcheckers It works really well with well-written concatenated words –  it is able to break them up with great accuracy. Combining split words is not as accurate Drawback: it’s based on index data.
  • 26.
  • 27.
    White-listing •  •  •  Any spelling systemcan only be as good as its knowledge base or dictionary is accurate. We need to control the data indexed as dictionary. White-listing approach –  we only index spelling data contained in a controlled dictionary list. –  processes to build a base dictionary specialized for a given site.
  • 28.
  • 29.
  • 30.
    Initial spellchecker configuration •  •  DirectSpellCheckerusing purged spell field –  Spell field filled with purged content •  Purging according to whitelist •  Whitelist generated from matching dictionary with index words, after purge process Benefits: –  Build is no longer required. –  Spell field is automatically updated via pipeline. –  We can work with term freq. –  No additional index, just an additional field. –  Better relevance and suggestions.
  • 31.
    Initial spellchecker configuration •  •  Cons: – Whitelist maintenance and creation for new sites. Features: –  Accurate detection of misspelled words. –  Good detection of concatenated words. •  piscinagarajejardin to piscina garaje jardin •  picina garajejardin to piscina (garaje jardin) –  Able to detect several misspelled words. –  Evolution based on whitelisting fine-tuning.
  • 32.
    Initial spellchecker configuration •  Issues: – False negatives: suggestion of corrections when words are correctly spelled. –  Suggestions for all the words in the query, not just those misspelled words. –  Misguiding “correctlySpelled” parameter. •  Parameter dependant on frequency information, making it unreliable for our purposes. •  It returns true/false according to thresholds, –  not really depending on word distance but –  results found, “alternativeTermCount” and “maxResultsForSuggest” thresholds. –  Minor discrepancies if we only index boosted terms (i.e. qf) •  # hits spell< #docs index
  • 33.
  • 34.
    Hacking SpellcheckComponent •  Lack ofreliability of the “correctlySpelled” parameter –  Difficult to know when give a suggestion or not. –  First policy based on document hits •  sliding window –  based on the number of queried terms •  the longer the tail, the smaller the threshold •  inaccurate and prone to collisions. –  Difficult to set up thresholds to a good level of accuracy. We needed a more reliable way.
  • 35.
    Hacking SpellcheckComponent: correctlySpelled parameterbehaviour •  Binary approach to deciding if a word is correctly spelled or not. •  Simpler approach –  any term that appears in our spelling field is a correctly spelled word •  regardless the value of its frequency info or the configured thresholds. –  this way the parameter can be used to control when to start querying the spellchecking index.
  • 36.
    Hacking SpellcheckComponent •  Other changesto the SpellcheckComponent: –  No suggestions when words are correctly spelled. –  Only makes suggestions for misspelled words, not for all words •  i.e. piscina garage -> piscina garaje •  Spanish-friendly ASCIIFoldingFilter –  modified in order to not fold “ñ” (for Spanish) and “ç” (for Catalan names) characters. •  Avoids collisions with similar words with “n” and “c” –  e.g. “pena” and “peña” –  Still folding accented vowels •  usually omitted by users.
  • 37.
  • 38.
    Conclusion & FutureWork •  •  •  Base code –  expand the spellchecking process to other sites –  design final policy to decide when giving suggestions or not. Geodata in homes verticals –  find ways to avoid collisions in large dictionary sets. Scoring system for spelling dictionary –  Control suggestions based on user input •  Feedback on relevance or quality of our spellchecking suggestions. •  System more accurate and reliable •  Expand whitelists to cover large amounts of geodata –  with acceptable levels of precision.
  • 39.
    Conclusion & FutureWork •  Plural suggester –  suggest alternative searches and corrections using plural or singular variants of the terms in the query. –  Use frequency and scoring information to choose most suitable suggestions.
  • 40.
    THANKS FOR YOURATTENTION! ANY QUESTIONS?
  • 41.
    References [1] Lucene/Solr RevolutionEU 2013. Dublin, 6-7 November 2013. http://www.lucenerevolution.org/ [2] Trovit – A search engine for classified ads of real estate, jobs, cars and vacation rentals. http://www.trovit.com [3] Apache Software Foundation. “Apache Solr” https://lucene.apache.org/solr/ [4] Apache Software Foundation. “Apache Lucene” https://lucene.apache.org [5] Apache Software Foundation. “Spell Checking – Apache Solr Reference Guide – Apache Software Foundation” https://cwiki.apache.org/confluence/display/solr/Spell+Checking