Identification of Relevant Sections in Web Pages Using a Machine Learning Approach

Identiﬁcation of Relevant Sections in Web Pages Using a
Machine Learning Approach

Jerrin Shaji George

NIT Calicut

November 8, 2012

Introduction

There is a massive amount of data available on the internet.
Extracting only the relevant content has become very important.
A Machine Learning approach is suitable as it can adapt to the
rapidly changing dynamics of the internet.

2 of 28

Machine Learning

The science of getting computers to act without being explicitly
programmed.
A method of teaching computers to make and improve predictions
or behaviors based on some data.
Machine Learning Algorithms :
Supervised Machine Learning
Unsupervised Machine Learning

3 of 28

Supervised Learning

Machine learning task of inferring a function from labeled training
data.

Figure: Supervised Learning Model (courtesy scikit-learn)
4 of 28

Supervised Learning

Example of a classiﬁcation problem - discrete valued output.

Figure: Copyright c Victor Lavrenko

5 of 28

Supervised Learning

Example of a regression problem - continuous valued output.


6 of 28

Unsupervised Learning

The data has no labels. The algorithm tries to ﬁnd similarities
between the objects in question.

Figure: Unsupervised Learning Model (courtesy scikit-learn)
7 of 28

Unsupervised Learning

Example of a clustering problem

8 of 28

Support Vector machines (SVM)

A supervised learning model.
Used for classification and regression analysis.
The basic SVM:
A non-probabilistic binary linear classifier.
Classifies each given input into one of the two possible classes which
forms the output.

9 of 28

The SVM Algorithm

Inputs are formulated as feature vectors.
The feature vectors are mapped into a feature space by using a
kernel function.
A division is computed in the feature space to optimally separate
the classes of training vectors.

10 of 28

The SVM Algorithm

φ: The Kernel Function

11 of 28

Formal Definition of SVM

An SVM constructs a hyperplane or set of hyperplanes in a high-
or infinite-dimensional space.
It can be used for classification and regression.
A good separation is achieved by the hyperplane that has the
largest distance to the nearest training data point of any class
(called the functional margin).

12 of 28

Optimal Separating Hyperplane

Figure: Courtesy Steve Gunn

13 of 28

Functional Margin

The vectors (points) that constrain the width of the margin are the
support vectors.

14 of 28
Figure: Image from scikit-learn

Mapping to Higher Dimensions

Sometime data is not linearly separable.
If the original ﬁnite-dimensional space is mapped into a much
higher-dimensional space, the separation is made easier in that
space.
This is achieved by the SVM using the Kernel Trick.

15 of 28

Mapping to Higher Dimensions

Mapping from 1D to 2D

Mapping from 2D to 3D

16 of 28
Figure: Coutesy Steve Gunn

Identiﬁcation of Relevant Sections in a Web Page for
Web Search

Shallow techniques like keyword matching gives unsatisfactory
results.
Search methodologies must focus more on contextual information
than just keyword occurrences.
Search term might not a be very diﬀerentiating term.
It might not appear in the section at all.

SQUINT : an SVM based approach to identify sections of a Web
page relevant to a Web Search.

17 of 28

Overall Architecure

18 of 28

Feature Generation

Word Rank Based Features
Bigram Rank Based Features
Coverage of Top Ranked Tokens
Query Word Frequency
Distance from the Query

19 of 28

Word Rank Based Features

The rank of a word is deﬁned to be its position in the list if the
words were ordered by frequency of occurrence across all search
results.
The value of this feature is the frequency of the particular word in
the given section.
Bucketing can be used to reduce dimensionality.

20 of 28

Bigram Rank Based Features

A bigram is deﬁned to be two consecutive words occurring in a
section.
Eg. Machine learning may be more important than machine and
learning separately.
The value of the feature is calculated same as Word Rank Based
Features.

21 of 28

Coverage of Top Ranked Tokens

Relevance may also be determined by the number of top ranked
words which occur in the section.
The value of this feature is the coverage of top ranked words per
bucket.

22 of 28

Distance from the Query

The intuition here is that the closer a section is to the query in the
Web page, the more likely it is to be relevant.
The value of this feature is the section-wise distance between the
section in question and the nearest section which contains the
query.

23 of 28

Query Word Frequency

The value of this feature is the frequency of the query word in the
section.
The value is normalized by the number of words in the section.

24 of 28

Training Set Generation

Query Google to get a set of pages
Clean each page remove scripts, pictures, links etc.
Break each page into sections.
Label each section of every page.

25 of 28

Learning Algorithm

An Support Vector Machine with a linear kernel is used.
Given the relatively high dimensionality of the feature vector, it is a
reasonable choice to use an SVM.
The predicted margins of each sample are used to get a non-binary
metric of how relevant each sections are.

26 of 28

Conclusion

Support Vector Machines are an attractive approach to data
modelling.
Evaluations suggest that using information retrieval inspired
features and some basic hints from summarization give respectable
accuracy with respect to detecting the most relevant section in a
page.
Thus SQUINT can have a large impact on the user’s overall search
experience.

27 of 28

References

Cristianini, Nello; and Shawe-Taylor, John; An Introduction to
Support Vector Machines and other kernel-based learning methods,
Cambridge University Press, 2000.
Siddharth Jonathan J.B., Riku Inoue and Jyotika Prasad. SQUINT
SVM for Identiﬁcation of Relevant Sections in Web Pages for Web
Search.
Wikipedia article on Machine Learning,
http://en.wikipedia.org/wiki/Support vector machine
Machine Learning Course on Coursera,
https://class.coursera.org/ml-2012-002/class/index

28 of 28

Identification of Relevant Sections in Web Pages Using a Machine Learning Approach

More Related Content

What's hot

Similar to Identification of Relevant Sections in Web Pages Using a Machine Learning Approach

Recently uploaded

Identification of Relevant Sections in Web Pages Using a Machine Learning Approach