web intelligence
INFORMATION RETERIVAL
• Information retrieval (IR) is the process of accessing and retrieving relevant
information from a collection of data, typically in the form of text documents or
multimedia content. It involves techniques and methods for effectively searching,
organizing, and presenting information to meet the needs of users.
• Information retrieval (IR) or simply searching. Searching isn’t new functionality;
nearly every application has some implementation of search, but intelligent
searching goes beyond plain old searching.
• Key components of information retrieval include:
• Indexing: Creating an index of the documents in the collection, which involves
analyzing and extracting important keywords, phrases, or metadata to represent the
content of each document.
• Query Processing: Processing user queries to understand their information needs
and matching them to relevant documents in the collection.
• Ranking: Ranking the retrieved documents based on their relevance to the query,
typically using algorithms that consider factors such as keyword frequency,
document popularity, and semantic similarity.
• User Interfaces: Providing user-friendly interfaces for users to interact with the
retrieval system, submit queries, and browse or navigate through the retrieved
results.
• Evaluation: Assessing the performance of the retrieval system using metrics such
as precision, recall, and relevance to measure how effectively it retrieves relevant
infor
Information retrieval is used in various applications and domains,
• web search engines,
• digital libraries,
• enterprise search systems,
• recommendation systems,
• e-commerce platforms.
• It plays a crucial role in enabling users to access and make sense of large volumes
of information efficiently, thereby facilitating decision-making, research, and
knowledge discovery.
IR LIBRARIES
• Experimentation can convince you that the naïve IR solution is full of
problems.
• For example, as soon as you increase the number of documents, or
their size, its per_x0002_formance will become unacceptable for
most purposes.
• there’s an enormous amount of knowledge about IR and fairly
sophisticated and robust libraries are available that offer scalability
and high performance.
• The most successful IR library in the Java programming language is
Lucene, a project created by Doug Cutting
Searching with Lucene
• Lucene can help you solve the IR problem by indexing all your documents and
letting you search through them at lightning speeds! Lucene in Action by Otis
Gospodnetic´ and Erik Hatcher, published by Manning, is a must-read,
especially if you want to know how to index data and introduces search,
sorting, filtering and highlighting search results.
1.The data that you want to search could be in your database, on the internet,
or on any other network that’s accessible to your application. You can collect
data from the internet by using a crawler.A number of crawlers are freely
available.
2.We’ll use a number of pages that we collected on November 6, 2006, so we
can modify them in a controlled fashion and observe the effect of these
changes in the results of the algorithms.
3.These pages have been cleaned up and changed to form a tiny You can find
these pages under the data/ch02/ directory. It’s important to know the content of
these documents, so that you can appreciate what the algorithms do and
understand how they work.
EXAMPLE
Our 15 documents are (the choice of content was random)
A Seven documents related to business news
B.Three documents related to Lance Armstrong’s attempt to run the marathon in
New York.
C.Four documents related to U.S. politics and, in particular, the congressional
elections (circa 2006).
D Five documents related to world news; four about Ortega winning the elections
in Nicaragua and one about global warming.
4.Lucene can help us analyze, index, and search these and any other
document that can be converted into text, so it’s not limited to web
pages. The class that we’ll use to quickly read the stored web pages is
called FetchAndProcessCrawler;
5.this class can also retrieve data from the internet. Its constructor takes
three arguments:
■ The base directory for storing the retrieved data.
■ The depth of the link structure that should be traversed.
■ The maximum number of total documents that should be retrieved.
Reading, indexing, and searching the default
list of web pages
• The crawling and preprocessing stage should take only a few seconds, and
when it finishes you should have a new directory under the base
directory. In our example, the base directory was C:/iWeb2/data/ch02.
The new directory’s name will start with the string crawl- and be followed
by the numeric value of the crawl’s timestamp in milli_x0002_seconds—
for example, crawl-1200697910111.
• You can change the content of the documents, or add more documents,
and rerun the preprocessing and indexing of the files in order to observe
the differences in your search results. Figure 2.1 is a snapshot of
executing the code from listing 2.1 in the BeanShell, and it includes the
results of the search for the term “armstrong.”
Understanding the Lucene code
• LUCENE CODE
• 1.The LuceneIndexBuilder creates a Lucene
index
• The IndexWriter class is what Lucene uses to
create an index. It comes with a large
number of constructors, which you can
peruse in the Javadocs. The specific
construc_x0002_tor that we use in our code
takes three arguments:
• ■ The directory where we want to store the
index.
• ■ The analyzer that we want to use—we’ll
talk about analyzers later in this
• section.
• ■ A Boolean variable that determines
whether we need to override the existing
• directory.
• 2.MySearcher: retrieving search results based on Lucene
indexing
• REVIEW
• 1.We use an instance of the Lucene IndexSearcher class to open our index for searching.
• 2 We create an instance of the Lucene QueryParser class by providing the name of the field
that we query against and the analyzer that must be used for tokenizing the query text.
• 3 We use the parse method of the QueryParser to transform the human-readable query into a
Query instance that Lucene can understand.
• 4 We search the index and obtain the results in the form of a Lucene Hits object.
• 5 We loop over the first n results and collect them in the form of our own SearchResult
objects. Note that Lucene’s Hits object contains only references to the underlying documents.
We use these references to collect the required fields; for example, the call
hits.doc(i).get("url") will return the URL that we stored in the index.
• 6 The relevance score for each retrieved document is recorded. This score is a number
between 0 and 1.
basic stages of search
• ■ Crawling
• ■ Parsing
• ■ Analyzing
• ■ Indexing
• ■ Searching
Improving search results based on link
analysis
• In link analysis algorithm that makes Google special—PageRank. The
PageRank algorithm was introduced in 1998, at the seventh international
World Wide Web conference (WWW98), by Sergey Brin and Larry Page in a
paper titled “The anatomy of a large-scale hypertextual Web search engine.”
Around the same time,
• Jon Kleinberg at IBM Almaden had discovered the Hypertext Induced Topic
Search (HITS) algorithm. Both algorithms are link analysis models, although
HITS didn’t have the degree of commercial success that PageRank did.
• PageRank algorithm and the mechanics of calculating ranking values. We’ll
also examine the so-called tele_x0002_portation mechanism and the inner
workings of the power method, which is at the heart of the PageRank
algorithm.
• we’ll introduce the basic concepts behind the PageRank algorithm and the
mechanics of calculating ranking values. We’ll also examine the so-called
teleportation mechanism and the inner workings of the power method, which is at
the heart of the PageRank algorithm. Lastly, we’ll demonstrate the combination of
index scores and PageRank scores for improving our search results.
• An introduction to PageRank
The key idea of PageRank is to consider hyperlinks from one page to another as
recommendations So, the more endorsements a page has the higher its importance
should be.
If web page A has a link to web page B, there’s an arrow pointing from A to B.
Based on this figure, we’ll introduce the hyperlink matrix H and a row vector p (the
PageRank vector). Think of a matrix as nothing more than a table (a 2D array) and a
vector as a single array in Java. Each row in the matrix H is constructed by counting
the number of all the outlinks from page Pi , say N(i) and assigning to column j the
value 1/N(i) if there’s an outlink from page Pi to page Pj, or assigning the value 0
otherwise. Thus, for the graph
shows the directed graph for all our sample web pages that start with the prefix biz. The titles of
these articles and their file names are given in table
• zero represent the sparse matrix
it show minimum nunber
hyperlink
• All values in the matrix are less
than or equal to 1. This turns out
to be very important
Calculating the PageRank vector
• The PageRank algorithm calculates the vector p using the following iterative formula:
• p (k+1) = p (k) * H
• The values of p are the PageRank values for every page in the graph. You start with a
• set of initial values such as p(0) = 1/n, where n is the number of pages in the graph,
• and use the formula to obtain p(1), then p(2), and so on, until the difference between
• two successive PageRank vectors is small enough; that arbitrary smallness is also
• known as the convergence criterion or threshold. This iterative method is the power
method
• as applied to H. That, in a nutshell, is the PageRank algorithm
Problem in page Rank
• The first problem is that on the internet there are some pages that don’t point to
any other pages; in our example, such a web page is biz-02 in figure 2.5. We call these
pages of the graph dangling nodes. These nodes are a problem because they trap our
surfer; without outlinks, there’s nowhere to go! They correspond to rows that have
value equal to zero for all their cells in the H matrix. To fix this problem, we introduce
a random jump, which means that once our surfer reaches a dangling node, he may go
to the address bar of his browser and type the URL of any one of the graph’s pages. In
terms of the H matrix, this corresponds to setting all the zeros (of a dangling node
row) equal to 1/n, where n is the number of pages in the graph. Technically, this
correction of the H matrix is referred to as the stochasticity adjustment.
• The second problem is that sometimes our surfer may get bored, or interrupted, and
may jump to another page without following the linked structure of the web pages;
the equivalent of Star Trek’s teleportation beam. To account for these arbitrary jumps,
we introduce a new parameter that, in our code, we call alpha. This parameter
determines the amount of time that our surfer will surf by following the links versus
jumping arbitrarily from one page to another page; this parameter is sometimes
referred to as the damping factor. Technically, this correction of the H matrix is
referred to as the primitivity adjustment
• , Google’s PageRank and Beyond: The Science of Search Engine Rankings by Amy
Langville and Carl Meyer is an excellent reference. So, let’s get into action and get
the H matrix by running some code. Listing 2.5 shows how to load just the web pages
that belong to the business news and calculate the PageRank that corresponds to
them..
• Understanding the power method
• Combining the index scores and the PageRank scores
Improving search results based on user clicks
• Using the NaiveBayes classifier
• Classification relies on reference structures that divide the space of all possible
data points into a set of classes (also known as categories or con_x0002_cepts)
that are (usually) non-overlapping.
• probabilistic classifier that implements what’s known as the naïve Bayes
algorithm; our implementation is provided by the NaiveBayes class. Classifiers
are agnostic to UserClicks, they’re only concerned with Concepts, Instances,
and Attributes.
• A classifier’s job is to assign a Concept to an Instance; that’s all a classifier does.
In order to know what Concept should be assigned to a particular Instance, a
classifier reads a TrainingSet—a set of Instances that already have a Concept
assigned to them. Upon loading those Instances, the classifier trains itself, or
learns, how to map a Conceptto an Instance based on the assignments in the
TrainingSet. The way that each classifier trains depends on the classifier.
• The good thing about the NaiveBayes classifier is that it provides something called the
conditional probability of X given Y—a probability that tells us how likely is it to
observe event X provided that
• we’ve already observed event Y. In particular, this classifier uses as input the following:
• ■ The probability of observing concept X, in general, also known as the prior
probability and denoted by p(X).
• ■ The probability of observing instance Y if we randomly select an instance from
concept X, also known as the likelihood and denoted by p(Y|X).
• ■ The probability of observing instance Y in general, also known as the evidence
and denoted by p(Y)
• The calculation is performed based on the following formula
• (known as Bayes theorem):
• p(X|Y) = p(Y|X) p(X) / p(Y)
Ranking Word, PDF, and other documents
without links
to introduce ranking in documents without links, we’ll take the HTML documents
and create Word documents with identical content. This willl allow us to compare
our results and identify any similarities or differences in the two approaches. Parsing
Word documents can be done easily using the open source library TextMining; note
that the name has changed to tm-extractor.http:// code.google.com/p/text
mining/source/checkout. We’ve written a class called MSWordDocumentParser that
encapsulates the parsing of a Word document
An introduction to DocRank
se the same classes to read the Word documents as we did to read
the HTML documents (the FetchAndProcessCrawler class) and we use Lucene to
index the content of these documents.
INFORMATION RETRIEVAL IN WEB INTELLIGENCE

INFORMATION RETRIEVAL IN WEB INTELLIGENCE

  • 1.
  • 2.
    • Information retrieval(IR) is the process of accessing and retrieving relevant information from a collection of data, typically in the form of text documents or multimedia content. It involves techniques and methods for effectively searching, organizing, and presenting information to meet the needs of users. • Information retrieval (IR) or simply searching. Searching isn’t new functionality; nearly every application has some implementation of search, but intelligent searching goes beyond plain old searching. • Key components of information retrieval include: • Indexing: Creating an index of the documents in the collection, which involves analyzing and extracting important keywords, phrases, or metadata to represent the content of each document.
  • 3.
    • Query Processing:Processing user queries to understand their information needs and matching them to relevant documents in the collection. • Ranking: Ranking the retrieved documents based on their relevance to the query, typically using algorithms that consider factors such as keyword frequency, document popularity, and semantic similarity. • User Interfaces: Providing user-friendly interfaces for users to interact with the retrieval system, submit queries, and browse or navigate through the retrieved results. • Evaluation: Assessing the performance of the retrieval system using metrics such as precision, recall, and relevance to measure how effectively it retrieves relevant infor
  • 4.
    Information retrieval isused in various applications and domains, • web search engines, • digital libraries, • enterprise search systems, • recommendation systems, • e-commerce platforms. • It plays a crucial role in enabling users to access and make sense of large volumes of information efficiently, thereby facilitating decision-making, research, and knowledge discovery.
  • 5.
    IR LIBRARIES • Experimentationcan convince you that the naïve IR solution is full of problems. • For example, as soon as you increase the number of documents, or their size, its per_x0002_formance will become unacceptable for most purposes. • there’s an enormous amount of knowledge about IR and fairly sophisticated and robust libraries are available that offer scalability and high performance. • The most successful IR library in the Java programming language is Lucene, a project created by Doug Cutting
  • 6.
    Searching with Lucene •Lucene can help you solve the IR problem by indexing all your documents and letting you search through them at lightning speeds! Lucene in Action by Otis Gospodnetic´ and Erik Hatcher, published by Manning, is a must-read, especially if you want to know how to index data and introduces search, sorting, filtering and highlighting search results. 1.The data that you want to search could be in your database, on the internet, or on any other network that’s accessible to your application. You can collect data from the internet by using a crawler.A number of crawlers are freely available. 2.We’ll use a number of pages that we collected on November 6, 2006, so we can modify them in a controlled fashion and observe the effect of these changes in the results of the algorithms.
  • 7.
    3.These pages havebeen cleaned up and changed to form a tiny You can find these pages under the data/ch02/ directory. It’s important to know the content of these documents, so that you can appreciate what the algorithms do and understand how they work. EXAMPLE Our 15 documents are (the choice of content was random) A Seven documents related to business news B.Three documents related to Lance Armstrong’s attempt to run the marathon in New York. C.Four documents related to U.S. politics and, in particular, the congressional elections (circa 2006). D Five documents related to world news; four about Ortega winning the elections in Nicaragua and one about global warming.
  • 8.
    4.Lucene can helpus analyze, index, and search these and any other document that can be converted into text, so it’s not limited to web pages. The class that we’ll use to quickly read the stored web pages is called FetchAndProcessCrawler; 5.this class can also retrieve data from the internet. Its constructor takes three arguments: ■ The base directory for storing the retrieved data. ■ The depth of the link structure that should be traversed. ■ The maximum number of total documents that should be retrieved.
  • 9.
    Reading, indexing, andsearching the default list of web pages
  • 10.
    • The crawlingand preprocessing stage should take only a few seconds, and when it finishes you should have a new directory under the base directory. In our example, the base directory was C:/iWeb2/data/ch02. The new directory’s name will start with the string crawl- and be followed by the numeric value of the crawl’s timestamp in milli_x0002_seconds— for example, crawl-1200697910111. • You can change the content of the documents, or add more documents, and rerun the preprocessing and indexing of the files in order to observe the differences in your search results. Figure 2.1 is a snapshot of executing the code from listing 2.1 in the BeanShell, and it includes the results of the search for the term “armstrong.”
  • 12.
  • 13.
    • LUCENE CODE •1.The LuceneIndexBuilder creates a Lucene index • The IndexWriter class is what Lucene uses to create an index. It comes with a large number of constructors, which you can peruse in the Javadocs. The specific construc_x0002_tor that we use in our code takes three arguments: • ■ The directory where we want to store the index. • ■ The analyzer that we want to use—we’ll talk about analyzers later in this • section. • ■ A Boolean variable that determines whether we need to override the existing • directory.
  • 14.
    • 2.MySearcher: retrievingsearch results based on Lucene indexing
  • 15.
    • REVIEW • 1.Weuse an instance of the Lucene IndexSearcher class to open our index for searching. • 2 We create an instance of the Lucene QueryParser class by providing the name of the field that we query against and the analyzer that must be used for tokenizing the query text. • 3 We use the parse method of the QueryParser to transform the human-readable query into a Query instance that Lucene can understand. • 4 We search the index and obtain the results in the form of a Lucene Hits object. • 5 We loop over the first n results and collect them in the form of our own SearchResult objects. Note that Lucene’s Hits object contains only references to the underlying documents. We use these references to collect the required fields; for example, the call hits.doc(i).get("url") will return the URL that we stored in the index. • 6 The relevance score for each retrieved document is recorded. This score is a number between 0 and 1.
  • 16.
    basic stages ofsearch • ■ Crawling • ■ Parsing • ■ Analyzing • ■ Indexing • ■ Searching
  • 17.
    Improving search resultsbased on link analysis • In link analysis algorithm that makes Google special—PageRank. The PageRank algorithm was introduced in 1998, at the seventh international World Wide Web conference (WWW98), by Sergey Brin and Larry Page in a paper titled “The anatomy of a large-scale hypertextual Web search engine.” Around the same time, • Jon Kleinberg at IBM Almaden had discovered the Hypertext Induced Topic Search (HITS) algorithm. Both algorithms are link analysis models, although HITS didn’t have the degree of commercial success that PageRank did. • PageRank algorithm and the mechanics of calculating ranking values. We’ll also examine the so-called tele_x0002_portation mechanism and the inner workings of the power method, which is at the heart of the PageRank algorithm.
  • 18.
    • we’ll introducethe basic concepts behind the PageRank algorithm and the mechanics of calculating ranking values. We’ll also examine the so-called teleportation mechanism and the inner workings of the power method, which is at the heart of the PageRank algorithm. Lastly, we’ll demonstrate the combination of index scores and PageRank scores for improving our search results. • An introduction to PageRank The key idea of PageRank is to consider hyperlinks from one page to another as recommendations So, the more endorsements a page has the higher its importance should be. If web page A has a link to web page B, there’s an arrow pointing from A to B. Based on this figure, we’ll introduce the hyperlink matrix H and a row vector p (the PageRank vector). Think of a matrix as nothing more than a table (a 2D array) and a vector as a single array in Java. Each row in the matrix H is constructed by counting the number of all the outlinks from page Pi , say N(i) and assigning to column j the value 1/N(i) if there’s an outlink from page Pi to page Pj, or assigning the value 0 otherwise. Thus, for the graph
  • 19.
    shows the directedgraph for all our sample web pages that start with the prefix biz. The titles of these articles and their file names are given in table
  • 20.
    • zero representthe sparse matrix it show minimum nunber hyperlink • All values in the matrix are less than or equal to 1. This turns out to be very important
  • 21.
    Calculating the PageRankvector • The PageRank algorithm calculates the vector p using the following iterative formula: • p (k+1) = p (k) * H • The values of p are the PageRank values for every page in the graph. You start with a • set of initial values such as p(0) = 1/n, where n is the number of pages in the graph, • and use the formula to obtain p(1), then p(2), and so on, until the difference between • two successive PageRank vectors is small enough; that arbitrary smallness is also • known as the convergence criterion or threshold. This iterative method is the power method • as applied to H. That, in a nutshell, is the PageRank algorithm
  • 22.
    Problem in pageRank • The first problem is that on the internet there are some pages that don’t point to any other pages; in our example, such a web page is biz-02 in figure 2.5. We call these pages of the graph dangling nodes. These nodes are a problem because they trap our surfer; without outlinks, there’s nowhere to go! They correspond to rows that have value equal to zero for all their cells in the H matrix. To fix this problem, we introduce a random jump, which means that once our surfer reaches a dangling node, he may go to the address bar of his browser and type the URL of any one of the graph’s pages. In terms of the H matrix, this corresponds to setting all the zeros (of a dangling node row) equal to 1/n, where n is the number of pages in the graph. Technically, this correction of the H matrix is referred to as the stochasticity adjustment.
  • 23.
    • The secondproblem is that sometimes our surfer may get bored, or interrupted, and may jump to another page without following the linked structure of the web pages; the equivalent of Star Trek’s teleportation beam. To account for these arbitrary jumps, we introduce a new parameter that, in our code, we call alpha. This parameter determines the amount of time that our surfer will surf by following the links versus jumping arbitrarily from one page to another page; this parameter is sometimes referred to as the damping factor. Technically, this correction of the H matrix is referred to as the primitivity adjustment • , Google’s PageRank and Beyond: The Science of Search Engine Rankings by Amy Langville and Carl Meyer is an excellent reference. So, let’s get into action and get the H matrix by running some code. Listing 2.5 shows how to load just the web pages that belong to the business news and calculate the PageRank that corresponds to them..
  • 24.
    • Understanding thepower method • Combining the index scores and the PageRank scores
  • 25.
    Improving search resultsbased on user clicks • Using the NaiveBayes classifier • Classification relies on reference structures that divide the space of all possible data points into a set of classes (also known as categories or con_x0002_cepts) that are (usually) non-overlapping. • probabilistic classifier that implements what’s known as the naïve Bayes algorithm; our implementation is provided by the NaiveBayes class. Classifiers are agnostic to UserClicks, they’re only concerned with Concepts, Instances, and Attributes. • A classifier’s job is to assign a Concept to an Instance; that’s all a classifier does. In order to know what Concept should be assigned to a particular Instance, a classifier reads a TrainingSet—a set of Instances that already have a Concept assigned to them. Upon loading those Instances, the classifier trains itself, or learns, how to map a Conceptto an Instance based on the assignments in the TrainingSet. The way that each classifier trains depends on the classifier.
  • 26.
    • The goodthing about the NaiveBayes classifier is that it provides something called the conditional probability of X given Y—a probability that tells us how likely is it to observe event X provided that • we’ve already observed event Y. In particular, this classifier uses as input the following: • ■ The probability of observing concept X, in general, also known as the prior probability and denoted by p(X). • ■ The probability of observing instance Y if we randomly select an instance from concept X, also known as the likelihood and denoted by p(Y|X). • ■ The probability of observing instance Y in general, also known as the evidence and denoted by p(Y) • The calculation is performed based on the following formula • (known as Bayes theorem): • p(X|Y) = p(Y|X) p(X) / p(Y)
  • 27.
    Ranking Word, PDF,and other documents without links to introduce ranking in documents without links, we’ll take the HTML documents and create Word documents with identical content. This willl allow us to compare our results and identify any similarities or differences in the two approaches. Parsing Word documents can be done easily using the open source library TextMining; note that the name has changed to tm-extractor.http:// code.google.com/p/text mining/source/checkout. We’ve written a class called MSWordDocumentParser that encapsulates the parsing of a Word document An introduction to DocRank se the same classes to read the Word documents as we did to read the HTML documents (the FetchAndProcessCrawler class) and we use Lucene to index the content of these documents.