Let's Build an Inverted Index: Introduction to Apache Lucene/Solr

Seminars
Let’s Build an Inverted Index:
Introduction to Apache
Lucene/Solr 
Alessandro Benedetti, Software Engineer 
 
Andrea Gazzarini, Software Engineer
28th November 2019

Seminars
▪ R&D Software Engineer
▪ Search Consultant
▪ Director
▪ Master Degree in Computer Science
▪ Apache Lucene/Solr Enthusiast
▪ Semantic, NLP, Machine Learning
Technologies passionate
▪ Conference Speaker
▪ Beach Volleyball Player &
Snowboarder
Alessandro Benedetti

Seminars
▪ Software Engineer (1999-)
▪ “Hermit” Software Engineer (2010-)
▪ Java & Information Retrieval Passionate
▪ Apache Qpid (past) Committer
▪ Husband & Father
▪ Bass Player
Andrea Gazzarini, “Gazza”

Seminars
Search Services 
● London Based - Italian made :)
● Open Source Enthusiasts
● Apache Lucene/Solr experts
! Community Contributors
● Active Researchers
● Hot Trends : Learning To Rank,  
Document Similarity,
Search Quality Evaluation, 
Relevancy Tuning

Seminars
Why should you use Open Source?
• State of the Art / very valid technologies
• Community Support
• Vast Documentation
• Code is accessible!
• Customisable
• Mostly free licensing

Seminars
Why should you contribute to Open Source?
• Share knowledge and ideas
• Improve established technologies
• Become part of a Community
• Not only code - all your skills are relevant!
• Be useful to the world

Seminars
We only deal with Open Source Informational
Retrieval … Revenue ?
● Trainings - Beginner/Intermediate/Advance/Ad Hoc for 
Information Retrieval, Apache Lucene/Solr, Search Relevance, Learning To Rank…
● Consulting - Open Source Software is ubiquitous/ Expertise ? Not really 
! R&D Projects - Cheaper and more flexible for Companies using Open Source 
! IR Projects - From the Client requirements collection till the Software delivery

Seminars
Information Retrieval
“Information retrieval (IR) is the activity of
obtaining information system resources relevant to
an information need from a collection of
information resources. Searches can be based on
full-text or other content-based indexing.
Information retrieval is the science of searching for
information in a document, searching for
documents themselves, and also searching for
metadata that describe data, and for databases of
texts, images or sounds.” Wikipedia
Information Need
Corpus

Seminars
Apache Lucene
• http://lucene.apache.org
• High-performance, scalable information retrieval software *library*
• Enables search capabilities to your applications
• Cohesive and simple interface, which hides a really complex world
• Open Source: Apache Top Level Project

Seminars
Apache Lucene - Brief History
2019 Lucene 8.3.0 (November)

Seminars
Apache Solr
• http://lucene.apache.org/solr
• Highly reliable, scalable and fault tolerant search *server*
• A Lucene “serverization” with a lot of additional features
• All services are exposed through a HTTP (REST-Like interface)
• Written in Java
• Rich ecosystem for building enterprise-level applications (Plugins,
Integrations, Clients)
• Open Source: Apache Top Level Project
“Solr is the popular, blazing-fast, open
source enterprise search platform built on
Apache Lucene™.”

Seminars
Apache Solr - Brief History
Version 8.3.0 (November)2019

Seminars
The Inverted Index
The Inverted Index is the basic data structure
used by Lucene to provide Search in a corpus of
documents.
From wikipedia :
“In computer science, an inverted index (also
referred to as postings file or inverted file) is an
index data structure storing a mapping from
content, such as words or numbers, to its locations
in a database file, or in a document or a set of
documents.”

Seminars
The Lucene Document
Document
Field ValueField Name
• Documents are the unit of information  
for indexing and search.
• A Document is a set of ﬁelds.
• Each ﬁeld has a name and a value.

Seminars
The Lucene Inverted Index

Seminars
The Lucene Inverted Index
• Lucene directory (in memory, on disk, memory mapped)
• Collection of immutable segments (fully working)
• Each segment is composed by a set of binary ﬁles[1]
[1] Lucene File Format Documentation
Indexes evolve by:
1. Creating new segments for newly added documents.
2. Merging existing segments.

Seminars
Schema Configuration
• Per collection/index
• Xml file
• Define how the inverted Index will be built
• Fields/Field Types definition

Seminars
Schema Configuration
• Define flexible expressions for groups of fields
• Shared attributes for each field instance
• Copy the source content to a destination field
• Allow to run multiple analysis chains for the same content

Seminars
Field Type
• Deﬁne how the single terms (in the inverted index) will be generated out of the content
Index Time
Query Time
Analysis chain executed
when building the index
Analysis chain executed
when building the query

Seminars
Text Analysis
• Only text ﬁelds types (e.g. solr.TextField or subclasses) have a text analysis chain associated
An analyzer can deﬁne
• Zero or more CharFilter
• One and only one Tokenizer
• Zero or more TokenFilter

Seminars
Char Filters
• CharFilter is a component that pre-processes input characters.
• CharFilters can be chained like Token Filters and placed in front of a Tokenizer.
• CharFilters can add, change, or remove characters 
while preserving the original character oﬀsets to support features like highlighting.

Seminars
Tokenizers
Tokenizers are responsible for breaking ﬁeld data into lexical units, or tokens.[1]
[1] https://lucene.apache.org/solr/guide/8_3/tokenizers.html

Seminars
Token Filters
Filters[1] examine a stream of tokens and keep them, transform them or discard them,  
depending on the ﬁlter type being used.
[1] https://lucene.apache.org/solr/guide/8_3/filter-descriptions.html

Seminars
Word Delimiters Filter
• Improve recall
• Dedicated Filters:  
solr.WordDelimiterGraphFilterFactory
[1] https://lucene.apache.org/solr/guide/8_3/filter-descriptions.html#word-delimiter-graph-filter
Example:
Default behavior. The whitespace tokenizer is used here to preserve non-alphanumeric characters.
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterGraphFilterFactory"/>
<filter class="solr.FlattenGraphFilterFactory"/> 
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterGraphFilterFactory"/>
</analyzer> 
In: "hot-spot RoboBlaster/9000 100XL"
Tokenizer to Filter: "hot-spot", "RoboBlaster/9000", "100XL"
Out: "hot", "spot", "Robo", "Blaster", "9000", "100", "XL"

Seminars
Stopword Filters
• Reduce index size
• Can improve precision (removing terms with low semantic value)
• Can improve recall
• Dedicated Filters: solr.StopFilterFactory, solr.ManagedStopFilterFactory
[1] https://lucene.apache.org/solr/guide/8_3/filter-descriptions.html#stop-filter
Example:
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<ﬁlter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
</analyzer>
In: "To be or what?"
Tokenizer to Filter: "To"(1), "be"(2), "or"(3), "what"(4)
Out: "what"(4)

Seminars
Stemmers
• Improve Recall
• Reduce index size
• Dedicated Filters: solr.EnglishMinimalStemFilterFactory, solr.HunspellStemFilterFactory, solr.KStemFilterFactory, 
solr.PorterStemFilterFactory, solr.SnowballPorterFilterFactory
[1] https://lucene.apache.org/solr/guide/8_3/filter-descriptions.html#porter-stem-filter
Example:
<ﬁlter class="solr.EnglishMinimalStemFilterFactory"/>
</analyzer>
In: "dogs cats"
Tokenizer to Filter: "dogs", "cats"
Out: "dog", "cat"

Seminars
Synonym Filters[1/2]
• Improve Recall
• Dedicated Filters: solr.SynonymGraphFilterFactory
• Index Time -> aﬀect terms distributions, needs re-indexing
• Query Time -> more ﬂexible
[1] https://lucene.apache.org/solr/guide/8_3/filter-descriptions.html#synonym-graph-filter
couch,sofa,divan
teh => the
huge,ginormous,humungous => large
small => tiny,teeny,weeny

Seminars
Synonym Filters[2/2]
• Improve Recall
• Dedicated Filters:  
solr.SynonymGraphFilterFactory
[1] https://lucene.apache.org/solr/guide/8_3/filter-descriptions.html#synonym-graph-filter
Example:
<filter class="solr.SynonymGraphFilterFactory" synonyms="mysynonyms.txt"/>
<filter class="solr.FlattenGraphFilterFactory"/> 
</analyzer>
<analyzer type="query">
<filter class="solr.SynonymGraphFilterFactory" synonyms="mysynonyms.txt"/>
</analyzer> 
In: "teh small couch"
Tokenizer to Filter: "teh"(1), "small"(2), "couch"(3)
Out: "the"(1), "tiny"(2), "teeny"(2), "weeny"(2), "couch"(3), "sofa"(3), "divan"(3)

Seminars
Keep Word Filter
• Help in Entity tagging
• Dedicated Filters: solr.KeepWordFilterFactory
[1] https://lucene.apache.org/solr/guide/8_3/filter-descriptions.html#keep-word-filter
Example:
<analyzer>
<ﬁlter class="solr.KeepWordFilterFactory" words="keepwords.txt" ignoreCase="true"/>
</analyzer>
In: "Happy, sad or funny"
Tokenizer to Filter: "Happy", "sad", "or", "funny"
Out: "Happy", "funny"

Seminars
N-Gram Filtering
• Improve Recall
• Ideal for autocompletion
• Dedicated Filters: solr.EdgeNGramFilterFactory, solr.NGramFilterFactory
[1] https://lucene.apache.org/solr/guide/8_3/filter-descriptions.html#edge-n-gram-filter
<analyzer>
<ﬁlter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="4"/>
</analyzer>

Seminars
Phonetic Matching
• Improve Recall
• Dedicated Filters: solr.BeiderMorseFilterFactory, solr.DaitchMokotoffSoundexFilterFactory,
solr.DoubleMetaphoneFilterFactory, solr.PhoneticFilterFactory
[1] https://lucene.apache.org/solr/guide/8_3/phonetic-matching.html
<analyzer>
<filter class="solr.BeiderMorseFilterFactory" nameType="GENERIC" ruleType="APPROX" concat="true"
languageSet="auto">
</filter>
</analyzer>

Seminars
Common Grams Filter
• Improve Precision
• Useful for phrase queries
[1] https://lucene.apache.org/solr/guide/8_3/filter-descriptions.html#common-grams-filter
Example:
<analyzer>
<ﬁlter class="solr.CommonGramsFilterFactory" words="stopwords.txt" ignoreCase="true"/>
</analyzer>
In: "the Cat"
Tokenizer to Filter: "the", "Cat"
Out: "the_cat"

Seminars
Solr Text Analysis - Hands On!
• Analysis Screen from Solr Admin
• Let’s explore the schema.xml

Seminars
Indexing
• Using the Solr Cell framework built on Apache Tika  
for ingesting binary files or structured files such as Office, Word, PDF, and other proprietary formats. 
(Recommended for prototyping and exercise) 
• Uploading XML files by sending HTTP requests to the Solr server  
from any environment where such requests can be generated. 
(Recommended for prototyping and exercise) 
• Writing a custom Java application to ingest data through Solr’s Java Client API  
(which is described in more detail in Client APIs).  
Using the Java API may be the best choice if you’re working with an application,  
such as a Content Management System (CMS), that offers a Java API.

Seminars
Indexing
• Indexing is the procedure of building an index from the documents in input
• Transaction Log (Rotating on hard commits)
• Index built in memory
• Soft commits(visibility)
• Hard commits(durability)
• openSearcher=true(visibility)
• Auto commit
• Merge policy

Seminars
Lucene Score
In order to measure the relevancy of a given result, Solr(Lucene) assigns it a “score”
The formula behind the score computation is behind the scope of this course, however important things tha
contribute to that formula are:
• Term Frequency (TF): how many times a given term occurs within a single document
• Document Frequency (DF): how many documents in the dataset contain a given term
• TF/IDF: the ratio between the term frequency and the inverse document frequency (1/DF)
• Field length: how many terms compose a field
• Boosting: functions or in general things that boost the score computed for a given match. Boosting  
can be applied at index time (deprecated now) or a query time
Score values cannot be compared across queries, or even with the same query but with a different index.

Seminars
! Origin from Probabilistic Information Retrieval
! Default Similarity from Lucene 6.0 [1]
! 25th iteration in improving TF-IDF
! TF
! IDF
! Document(Field) Length
! Configuration parameters
[1] LUCENE-6789
BM25 Term Scorer

Seminars
BM25 Term Scorer - Inverse Document Frequency
IDF Score 
has very similar behavior

Seminars
BM25 Term Scorer - Term Frequency
TF Score 
approaches 
asymptotically (k+1) 
 
k=1.2 in this example

Seminars
BM25 Term Scorer - Document Length
Document Length / 
Avg Document Length 
 
affects how fast we
saturate TF score

Seminars
Basic Search
The list is not exhaustive and is not statically defined, because it depends on the query parser
Some parameter (i.e. filters) accepts more than one value:

Seminars
Queries
Query
• Regulated by Query Parsers
• Calculates scores
• Cached with results order preserved
Filter Query
• Regulated by Query Parsers
• Does not calculate scores
• Cached independently
• Reusable
q=field:value fq=field:value

Seminars
Query Parsers
• Main responsibility of the query parser is understand the
input query syntax and build a Lucene query
• This is the first component involved in the query
execution chain
• If it is not specified, then a default parser is used (Solr
Standard Query Parser)
• Solr comes with several available and ready-to-use query
parsers
• The query parameter “defType” defines the query parser
that will be used in a request

Seminars
Standard Query Parser
Parameter Description
q Defines a query using standard query syntax. This parameter is mandatory.
q.op Specifies the default operator for query expressions, overriding the default
operator specified in the Schema. Possible values are "AND" or "OR".
df Specifies a default field, overriding the definition of a default field in the
Schema.
sow Split on whitespace: if set to false, whitespace-separated term sequences will
be provided to text analysis in one shot, enabling proper function of analysis
filters that operate over term sequences, e.g. multi-word synonyms and
shingles. Defaults to true: text analysis is invoked separately for each
individual whitespace-separated term.

Seminars
Standard Query Parser
• Phrase Search 
q=title:”a tale of two cities” 
• Wildcard Search 
q=title:c?ti* 
• Fuzzy Search 
q=title:cties~1
• Proximity Search  
q=title:"tale cities"~2
• Range Search  
downloads:[1000 TO 2000], author:{Ada TO Carmen}
• Boosted Search 
q=tale of two cities^100 bunny
• Constant Score Search 
AND subjects:(war stories)^=4
• Boolean Search 
(ﬁeld1:term1) AND (ﬁeld2:term1)

Seminars
Date Queries
Queries against ﬁelds using the TrieDateField type (typically range queries) should use the appropriate date syntax [1]:
• timestamp:[* TO NOW]
• createdate:[1976-03-06T23:59:59.999Z TO *]
• createdate:[1995-12-31T23:59:59.999Z TO 2007-03-06T00:00:00Z]
• pubdate:[NOW-1YEAR/DAY TO NOW/DAY+1DAY]
• createdate:[1976-03-06T23:59:59.999Z TO 1976-03-06T23:59:59.999Z+1YEAR]
• createdate:[1976-03-06T23:59:59.999Z/YEAR TO 1976-03-06T23:59:59.999Z]
[1] https://en.wikipedia.org/wiki/ISO_8601
Timezone
By default, all date math expressions are evaluated relative to the UTC TimeZone, but the TZ parameter can be
speciﬁed to override this behaviour
N.B. Independently of the locale Solr is executed, only ISO-8601 dates are supported in requests

Seminars
Solr Query Debug - Hands On!
• debug=query: return debug information about the query
only.
• debug=timing: return debug information about how long the
query took to process.
• debug=results: return debug information about the score
results (also known as "explain").

Seminars
Master Thesis:
Click Models to Estimate Relevancy Ratings from
Users Interactions
Main responsibility of the candidate will be to: 
• learn basic concepts of Agile methodologies for software engineering 
• learn details of Search Quality Evaluation 
• grasp the fundamentals of click modelling, implicit and explicit
relevancy feedback 
• design and implement the module in an existing Spring Boot REST
service application  
• benchmark the solution(s) through a careful quality/performance(times/
space) analysis

Seminars
Master Thesis:
Search Quality Evaluation for Continuous
Integration Tools
Main responsibility of the candidate will be to:  
 
• learn basic concepts of Agile methodologies for software engineering
• get familiar with Apache Lucene based search engines (Apache Solr/
Elasticsearch)
• learn details of Search Quality Evaluation
• grasp the fundamentals of Continuous Integration and Continuous Deployment
through well established industry level technologies
• design and implement plugins for Apache Jenkins, Atlassian Bamboo and
JetBrains  
TeamCity

Let's Build an Inverted Index: Introduction to Apache Lucene/Solr

In this document

More Related Content

What's hot

Similar to Let's Build an Inverted Index: Introduction to Apache Lucene/Solr

More from Sease

Recently uploaded

Let's Build an Inverted Index: Introduction to Apache Lucene/Solr