Text Analytics in Enterprise Search
         Daniel Ling (Findwise)
What will I cover?
   Intro
   About Text Analytics
   Benefits and possibilities
   Examples
   Solution Techniques to Examples
   Conclusions




                            3
My Background
   Daniel Ling
   Findwise
   Enterprise Search and Findability Consultant
   Experience and expertise
      5+ years of Enterprise Search Experience
      20+ enterprise search implementations, ranging industries
      Lucene, FAST ESP, Solr
      Apache Solr my primary search platform
      Focus areas includes Findability and Search Architecture and
       Implementation, Text Analytics, Document Processing.




                                    4
About Text Analytics




          5
Text Analytics in the Enterprise
Challenges:
 80% of data in the Enterprise is unstructured.
 Reduce the time looking for information (currently 9.6 hours per week)
 Reduce the time reading documents / e-mails (currently 14.5 hours per
  week)

Benefits:
 More predictable scale and domain
 Well-understood domain
 Supporting content for analytics can be identified




                                   6
Text Analytics
The definition


   A set of linguistic, statistical and machine learning techniques
   used to model and structure information content of textual
   source.

      - Wikipedia.org




                                7
Types of Applications


•   Entity Extraction
•   Document Categorization
•   Sentiment Analysis
•   Summarization




                              8
Frameworks and Techniques


Framework                          Techniques

Solr                               Statistics, Lingustics

Mallet, Classifier4j, etc, etc..   Statistical natural language processing

Mahout (Hadoop)                    Machine Learning, Statistics

GATE                               General language processing framework

UIMA                               Content analytics, text mining, pipeline

OpenNLP                            Machine learning toolkit for NLP


                                              9
Benefits and possibilities




            10
Benefits and possibilities

 Text analytics can bring some structure to the unstructured content
 Enhance discovery and findability of content
   • Works well together with search
 Increase relevance and precision with extracted keywords and meta-
  data
 Generating content for dynamic pages / topic pages
   • Selection of documents and extracts from documents
 Track and discover sentiments
 Reduce the time for user to analyze content




                                 11
Examples




   12
Entity Extraction

 Types of Entities for Extraction:
   • Dates
   • Places
   • Companies
   • Objects (Product names, etc)
   • People
   • Events




                                  13
Example – Presenting the data




               14
Example – Presenting the data




              15
Example – Facets on the data




               16
Example Solution: Entity Extraction
 Rule-based entity extraction
    Combination of lists and regular expressions
 Works within well-understood domains.
 Requires maintaining lists.
 Lists from: Country lists from World Factbook, Public Companies from
  Google Finance, Customers from CRM.
 Workflow: Document for indexing > Update Request Handler >
  Update Chain (lookup and match entities) > Writes to index



             Update Chain
                     (processor)                                   Lucene Index
        (lists | input fields | entity fields)
                                                 (entity fields)




                                                          17
Example Solution: Entity Extraction
 Register a custom class to lookup resources and extract found entities
  to specific Solr fields, setup in solrconfig.xml:




                                     18
Document Categorization

   To assign a label to the document / content / data.
   Labels for the category or for the sentiment.
   Threshold values for matching a category before labeling.
   Statistics and “knowledge” from previous examples can be used.




                                  19
Example – Facets from Categories




                 20
Example Solution: Document
                Categorization


                                               *

 Training the component, Mallet (Machine Learning for Language
  Toolkit).
   • Alternative components includes Lucene (TFIDF) index
      (MoreLikeThis), OpenNLP, Textcat, Classifier4j.
 Running the new documents against the model/index of trained
  documents.
 Training from interface, adhoc, or index pre-categorized.

* Figure from the book Taming Text.


                                      21
Example Solution: Document
             Categorization
 Mallet and the process of setup and train:




                                   22
Example Solution: Document
              Categorization
 Evaluation of new document:




 Setting the evaluated category tag to the document in pipeline:


            Update Chain
                 (processor)                        Lucene Index
              (input document)
                                 (category field)




                                            23
Document Summarization

 Summarize a document, at index time or on-demand.
 Leverage from the knowledge and term statistics of the document
  and the index.
 Picks the “most important” sentences based on the statistics and
  displays those.




                                 24
Example – Summarize content


Static Summaries




Dynamic Summaries




                    25
Example – Summarize content - 1




                   26
Example – Summarize content - 2




                  27
Example Solution: Document
           Summarization
 Custom RequestHandler that receives document ID and field to
  summarize.
 Custom Search Component making the selection of top sentences.
 Selecting a subset of sentences and sends these back in a field.




               RequestHandler                         Lucene Index
          (SearchComponent for summariziation)




                                                 28
Wrap Up

• Examples: Entity Extraction, Document Categorization,
  Summarization.
• Technology: You can take small steps and get a great
  deal of gain, since you can leverage from features and
  components of Solr and Lucene (as well as other open
  source NLP frameworks).
• Value: Benefits from text analytics includes the increase
  in discovery, findability and productivity from the
  solution.




                                29
Questions ?



daniel.ling@findwise.com
www.findabilityblog.com




            30

Text Analytics in Enterprise Search - Daniel Ling

  • 1.
    Text Analytics inEnterprise Search Daniel Ling (Findwise)
  • 2.
    What will Icover?  Intro  About Text Analytics  Benefits and possibilities  Examples  Solution Techniques to Examples  Conclusions 3
  • 3.
    My Background  Daniel Ling  Findwise  Enterprise Search and Findability Consultant  Experience and expertise  5+ years of Enterprise Search Experience  20+ enterprise search implementations, ranging industries  Lucene, FAST ESP, Solr  Apache Solr my primary search platform  Focus areas includes Findability and Search Architecture and Implementation, Text Analytics, Document Processing. 4
  • 4.
  • 5.
    Text Analytics inthe Enterprise Challenges:  80% of data in the Enterprise is unstructured.  Reduce the time looking for information (currently 9.6 hours per week)  Reduce the time reading documents / e-mails (currently 14.5 hours per week) Benefits:  More predictable scale and domain  Well-understood domain  Supporting content for analytics can be identified 6
  • 6.
    Text Analytics The definition A set of linguistic, statistical and machine learning techniques used to model and structure information content of textual source. - Wikipedia.org 7
  • 7.
    Types of Applications • Entity Extraction • Document Categorization • Sentiment Analysis • Summarization 8
  • 8.
    Frameworks and Techniques Framework Techniques Solr Statistics, Lingustics Mallet, Classifier4j, etc, etc.. Statistical natural language processing Mahout (Hadoop) Machine Learning, Statistics GATE General language processing framework UIMA Content analytics, text mining, pipeline OpenNLP Machine learning toolkit for NLP 9
  • 9.
  • 10.
    Benefits and possibilities Text analytics can bring some structure to the unstructured content  Enhance discovery and findability of content • Works well together with search  Increase relevance and precision with extracted keywords and meta- data  Generating content for dynamic pages / topic pages • Selection of documents and extracts from documents  Track and discover sentiments  Reduce the time for user to analyze content 11
  • 11.
  • 12.
    Entity Extraction  Typesof Entities for Extraction: • Dates • Places • Companies • Objects (Product names, etc) • People • Events 13
  • 13.
  • 14.
  • 15.
    Example – Facetson the data 16
  • 16.
    Example Solution: EntityExtraction  Rule-based entity extraction  Combination of lists and regular expressions  Works within well-understood domains.  Requires maintaining lists.  Lists from: Country lists from World Factbook, Public Companies from Google Finance, Customers from CRM.  Workflow: Document for indexing > Update Request Handler > Update Chain (lookup and match entities) > Writes to index Update Chain (processor) Lucene Index (lists | input fields | entity fields) (entity fields) 17
  • 17.
    Example Solution: EntityExtraction  Register a custom class to lookup resources and extract found entities to specific Solr fields, setup in solrconfig.xml: 18
  • 18.
    Document Categorization  To assign a label to the document / content / data.  Labels for the category or for the sentiment.  Threshold values for matching a category before labeling.  Statistics and “knowledge” from previous examples can be used. 19
  • 19.
    Example – Facetsfrom Categories 20
  • 20.
    Example Solution: Document Categorization *  Training the component, Mallet (Machine Learning for Language Toolkit). • Alternative components includes Lucene (TFIDF) index (MoreLikeThis), OpenNLP, Textcat, Classifier4j.  Running the new documents against the model/index of trained documents.  Training from interface, adhoc, or index pre-categorized. * Figure from the book Taming Text. 21
  • 21.
    Example Solution: Document Categorization  Mallet and the process of setup and train: 22
  • 22.
    Example Solution: Document Categorization  Evaluation of new document:  Setting the evaluated category tag to the document in pipeline: Update Chain (processor) Lucene Index (input document) (category field) 23
  • 23.
    Document Summarization  Summarizea document, at index time or on-demand.  Leverage from the knowledge and term statistics of the document and the index.  Picks the “most important” sentences based on the statistics and displays those. 24
  • 24.
    Example – Summarizecontent Static Summaries Dynamic Summaries 25
  • 25.
    Example – Summarizecontent - 1 26
  • 26.
    Example – Summarizecontent - 2 27
  • 27.
    Example Solution: Document Summarization  Custom RequestHandler that receives document ID and field to summarize.  Custom Search Component making the selection of top sentences.  Selecting a subset of sentences and sends these back in a field. RequestHandler Lucene Index (SearchComponent for summariziation) 28
  • 28.
    Wrap Up • Examples:Entity Extraction, Document Categorization, Summarization. • Technology: You can take small steps and get a great deal of gain, since you can leverage from features and components of Solr and Lucene (as well as other open source NLP frameworks). • Value: Benefits from text analytics includes the increase in discovery, findability and productivity from the solution. 29
  • 29.