Advanced Document Similarity
With Apache Lucene
Alessandro Benedetti, Software Engineer, Sease Ltd.
Alessandro Benedetti
● Search Consultant
● R&D Software Engineer
● Master in Computer Science
● Apache Lucene/Solr Enthusiast
● Semantic, NLP, Machine Learning Technologies passionate
● Beach Volleyball Player & Snowboarder
Who I am
Search Services
● Open Source Enthusiasts
● Apache Lucene/Solr experts
● Community Contributors
● Active Researchers
● Hot Trends : Learning To Rank, Document Similarity,
Measuring Search Quality, Relevancy Tuning
Sease Ltd
● Document Similarity
● Apache Lucene More Like This
● Term Scorer
● BM25
● Interesting Terms Retrieval
● Query Building
● DEMO
● Future Work
● JIRA References
Agenda
Real World Use Cases - Streaming Services
Real World Use Cases - Hotels
Document Similarity
Problem : find similar documents to a seed one
Solution(s) :
● Collaborative approach
(users interactions)
● Content Based
● Hybrid
Similar ?
● Documents accessed in
association to the input one by
users close to you
● Terms distributions
● All of above
Apache Lucene
Apache LuceneTM
is a high-performance, full-featured text search engine library
written entirely in Java.
It is a technology suitable for nearly any application that requires full-text
search, especially cross-platform.
Apache Lucene is an open source project available for free download.
● Search Library (java)
● Structured Documents
● Inverted Index
● Similarity Metrics ( TF-IDF, BM25)
● Fast Search
● Support for advanced queries
● Relevancy tuning
Apache Lucene
Inverted Index
Indexing
Pros
● Apache Lucene Module
● Advanced Params
● Input :
- structured document
- just text
● Build an advanced query
● Leverage the Inverted Index
( and additional data structures)
More Like This
Cons
● Massive single class
● Low cohesion
● Low readability
● Minimum test coverage
● Difficult to extend
( and improve)
Input
Document More Like This
Params
Interesting
Terms
Retriever
Term Scorer
Query Builder QUERY
More Like This - Break Up
Responsibility : define a set of parameters (and defaults) that affect the
various components of the More Like This module
● Regulate MLT behavior
● Groups parameters specific to each component
● Javadoc documentation
● Default values
● Useful container for various parameters to be passed
More Like This Params
● Field Name
● Field Stats ( Document Count)
● Term Stats ( Document Frequency)
● Term Frequency
● TF-IDF -> tf * (log ( numDocs / docFreq + 1) + 1)
● BM25
Term Scorer
Responsibility : assign a score to a term that measure how distinctive is the term
for the document in input
● Origin from Probabilistic Information Retrieval
● Default Similarity from Lucene 6.0 [1]
● 25th iteration in improving TF-IDF
● TF
● IDF
● Document Length
[1] LUCENE-6789
BM25 Term Scorer
BM25 Term Scorer - Inverse Document Frequency
IDF Score
has very similar
behavior
BM25 Term Scorer - Term Frequency
TF Score
approaches
asymptotically (k+1)
k=1.2 in this
example
BM25 Term Scorer - Document Length
Document Length /
Avg Document
Length
affects how fast we
saturate TF score
Responsibility : retrieve from the document a queue of weighted interesting
terms Params Used
● Analyzer
● Max Num Token Parsed
● Min Term Frequency
● Min/Max Document Frequency
● Max Query Terms
● Query Time Field Boost
Interesting Term Retriever
● Analyze content / Term Vector
● Skip Tokens
● Score Tokens
● Build Queue of Top Scored terms
Params Used
● Term Boost Enabled
More Like This Query Builder
Field1 :
Term1
Field2 :
Term2
Field1 :
Term3
Field1 :
Term4
Field3 :
Term5
3.0 4.0 4.5 4.8 7.5
Q = Field1:Term1^3.0 Field2:Term2^4.0
Field1:Term3^4.5 Field1:Term4^4.8
Field3:Term5^7.5
Term Boost
● on/off
● Affect each term weight in the
MLT query
● It is the term score
( it depends of the Term Scorer
implementation chosen)
More Like This Boost
Field Boost
● field1^5.0 field2^2.0 field3^1.5
● Affect Term Scorer
● Affect the interesting terms
retrieved
N.B. a highly boosted field can
dominate the interesting terms
retrieval
More Like This Usage - Lucene Classification
● Given a document D to classify
● K Nearest Neighbours Classifier
● Find Top K similar documents to D ( MLT)
● Classes are extracted
● Class Frequency + Class ranking -> Class probability
More Like This Usage - Apache Solr
● More Like This query parser
( can be concatenated with other queries)
● More Like This search component
( can be assigned to a Request Handler)
● More Like This handler
( handler with specific request parameters)
More Like This Demo - Movie Data Set
This data consists of the following fields:
● id - unique identifier for the movie
● name - Name of the movie
● directed_by - The person(s) who directed the making of the film
● initial_release_date - The earliest official initial film screening date in
any country
● genre - The genre(s) that the movie belongs to
More Like This Demo - Tuned
● Enable/Disable Term Boost
● Min Term Frequency
● Min Document Frequency
● Field Boost
● Ad Hoc fields ( ngram analysis)
Future Work
● Query Builder just use Terms and Term Score
● Term Positions ?
● Phrase Queries Boost
(for terms close in position)
● Sentence boundaries
● Field centric vs Document centric
( should high boosted fields kick out
relevant terms from low boosted fields)
Future Work - More Like These
● Multiple documents in input
● Interesting terms across
documents
● Useful for Content Based
recommender engines
● LUCENE-7498 - Introducing BM25 Term Scorer
● LUCENE-7802 - Architectural Refactor
JIRA References
Questions ?
Arigato !
ありがとう !

Advanced Document Similarity With Apache Lucene

  • 1.
    Advanced Document Similarity WithApache Lucene Alessandro Benedetti, Software Engineer, Sease Ltd.
  • 2.
    Alessandro Benedetti ● SearchConsultant ● R&D Software Engineer ● Master in Computer Science ● Apache Lucene/Solr Enthusiast ● Semantic, NLP, Machine Learning Technologies passionate ● Beach Volleyball Player & Snowboarder Who I am
  • 3.
    Search Services ● OpenSource Enthusiasts ● Apache Lucene/Solr experts ● Community Contributors ● Active Researchers ● Hot Trends : Learning To Rank, Document Similarity, Measuring Search Quality, Relevancy Tuning Sease Ltd
  • 4.
    ● Document Similarity ●Apache Lucene More Like This ● Term Scorer ● BM25 ● Interesting Terms Retrieval ● Query Building ● DEMO ● Future Work ● JIRA References Agenda
  • 5.
    Real World UseCases - Streaming Services
  • 6.
    Real World UseCases - Hotels
  • 7.
    Document Similarity Problem :find similar documents to a seed one Solution(s) : ● Collaborative approach (users interactions) ● Content Based ● Hybrid Similar ? ● Documents accessed in association to the input one by users close to you ● Terms distributions ● All of above
  • 8.
    Apache Lucene Apache LuceneTM isa high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. Apache Lucene is an open source project available for free download.
  • 9.
    ● Search Library(java) ● Structured Documents ● Inverted Index ● Similarity Metrics ( TF-IDF, BM25) ● Fast Search ● Support for advanced queries ● Relevancy tuning Apache Lucene
  • 10.
  • 11.
    Pros ● Apache LuceneModule ● Advanced Params ● Input : - structured document - just text ● Build an advanced query ● Leverage the Inverted Index ( and additional data structures) More Like This Cons ● Massive single class ● Low cohesion ● Low readability ● Minimum test coverage ● Difficult to extend ( and improve)
  • 12.
    Input Document More LikeThis Params Interesting Terms Retriever Term Scorer Query Builder QUERY More Like This - Break Up
  • 13.
    Responsibility : definea set of parameters (and defaults) that affect the various components of the More Like This module ● Regulate MLT behavior ● Groups parameters specific to each component ● Javadoc documentation ● Default values ● Useful container for various parameters to be passed More Like This Params
  • 14.
    ● Field Name ●Field Stats ( Document Count) ● Term Stats ( Document Frequency) ● Term Frequency ● TF-IDF -> tf * (log ( numDocs / docFreq + 1) + 1) ● BM25 Term Scorer Responsibility : assign a score to a term that measure how distinctive is the term for the document in input
  • 15.
    ● Origin fromProbabilistic Information Retrieval ● Default Similarity from Lucene 6.0 [1] ● 25th iteration in improving TF-IDF ● TF ● IDF ● Document Length [1] LUCENE-6789 BM25 Term Scorer
  • 16.
    BM25 Term Scorer- Inverse Document Frequency IDF Score has very similar behavior
  • 17.
    BM25 Term Scorer- Term Frequency TF Score approaches asymptotically (k+1) k=1.2 in this example
  • 18.
    BM25 Term Scorer- Document Length Document Length / Avg Document Length affects how fast we saturate TF score
  • 19.
    Responsibility : retrievefrom the document a queue of weighted interesting terms Params Used ● Analyzer ● Max Num Token Parsed ● Min Term Frequency ● Min/Max Document Frequency ● Max Query Terms ● Query Time Field Boost Interesting Term Retriever ● Analyze content / Term Vector ● Skip Tokens ● Score Tokens ● Build Queue of Top Scored terms
  • 20.
    Params Used ● TermBoost Enabled More Like This Query Builder Field1 : Term1 Field2 : Term2 Field1 : Term3 Field1 : Term4 Field3 : Term5 3.0 4.0 4.5 4.8 7.5 Q = Field1:Term1^3.0 Field2:Term2^4.0 Field1:Term3^4.5 Field1:Term4^4.8 Field3:Term5^7.5
  • 21.
    Term Boost ● on/off ●Affect each term weight in the MLT query ● It is the term score ( it depends of the Term Scorer implementation chosen) More Like This Boost Field Boost ● field1^5.0 field2^2.0 field3^1.5 ● Affect Term Scorer ● Affect the interesting terms retrieved N.B. a highly boosted field can dominate the interesting terms retrieval
  • 22.
    More Like ThisUsage - Lucene Classification ● Given a document D to classify ● K Nearest Neighbours Classifier ● Find Top K similar documents to D ( MLT) ● Classes are extracted ● Class Frequency + Class ranking -> Class probability
  • 23.
    More Like ThisUsage - Apache Solr ● More Like This query parser ( can be concatenated with other queries) ● More Like This search component ( can be assigned to a Request Handler) ● More Like This handler ( handler with specific request parameters)
  • 24.
    More Like ThisDemo - Movie Data Set This data consists of the following fields: ● id - unique identifier for the movie ● name - Name of the movie ● directed_by - The person(s) who directed the making of the film ● initial_release_date - The earliest official initial film screening date in any country ● genre - The genre(s) that the movie belongs to
  • 25.
    More Like ThisDemo - Tuned ● Enable/Disable Term Boost ● Min Term Frequency ● Min Document Frequency ● Field Boost ● Ad Hoc fields ( ngram analysis)
  • 26.
    Future Work ● QueryBuilder just use Terms and Term Score ● Term Positions ? ● Phrase Queries Boost (for terms close in position) ● Sentence boundaries ● Field centric vs Document centric ( should high boosted fields kick out relevant terms from low boosted fields)
  • 27.
    Future Work -More Like These ● Multiple documents in input ● Interesting terms across documents ● Useful for Content Based recommender engines
  • 28.
    ● LUCENE-7498 -Introducing BM25 Term Scorer ● LUCENE-7802 - Architectural Refactor JIRA References
  • 29.
  • 30.