Identification of Relevant Sections in Web Pages Using a
               Machine Learning Approach




                                  Jerrin Shaji George

                                      NIT Calicut


                                  November 8, 2012
Introduction

  There is a massive amount of data available on the internet.
  Extracting only the relevant content has become very important.
  A Machine Learning approach is suitable as it can adapt to the
  rapidly changing dynamics of the internet.




2 of 28
Machine Learning

  The science of getting computers to act without being explicitly
  programmed.
  A method of teaching computers to make and improve predictions
  or behaviors based on some data.
  Machine Learning Algorithms :
          Supervised Machine Learning
          Unsupervised Machine Learning




3 of 28
Supervised Learning

  Machine learning task of inferring a function from labeled training
  data.




           Figure: Supervised Learning Model (courtesy scikit-learn)
4 of 28
Supervised Learning

  Example of a classification problem - discrete valued output.




                   Figure: Copyright c Victor Lavrenko

5 of 28
Supervised Learning

  Example of a regression problem - continuous valued output.




                   Figure: Copyright c Victor Lavrenko

6 of 28
Unsupervised Learning

  The data has no labels. The algorithm tries to find similarities
  between the objects in question.




          Figure: Unsupervised Learning Model (courtesy scikit-learn)
7 of 28
Unsupervised Learning

  Example of a clustering problem




                   Figure: Copyright c Victor Lavrenko
8 of 28
Support Vector machines (SVM)

  A supervised learning model.
  Used for classification and regression analysis.
  The basic SVM:
          A non-probabilistic binary linear classifier.
          Classifies each given input into one of the two possible classes which
          forms the output.




9 of 28
The SVM Algorithm

   Inputs are formulated as feature vectors.
   The feature vectors are mapped into a feature space by using a
   kernel function.
   A division is computed in the feature space to optimally separate
   the classes of training vectors.




10 of 28
The SVM Algorithm

               φ: The Kernel Function




11 of 28
Formal Definition of SVM

   An SVM constructs a hyperplane or set of hyperplanes in a high-
   or infinite-dimensional space.
   It can be used for classification and regression.
   A good separation is achieved by the hyperplane that has the
   largest distance to the nearest training data point of any class
   (called the functional margin).




12 of 28
Optimal Separating Hyperplane




                 Figure: Courtesy Steve Gunn

13 of 28
Functional Margin

   The vectors (points) that constrain the width of the margin are the
   support vectors.




14 of 28
                       Figure: Image from scikit-learn
Mapping to Higher Dimensions

   Sometime data is not linearly separable.
   If the original finite-dimensional space is mapped into a much
   higher-dimensional space, the separation is made easier in that
   space.
   This is achieved by the SVM using the Kernel Trick.




15 of 28
Mapping to Higher Dimensions

   Mapping from 1D to 2D




   Mapping from 2D to 3D




16 of 28
                     Figure: Coutesy Steve Gunn
Identification of Relevant Sections in a Web Page for
Web Search

   Shallow techniques like keyword matching gives unsatisfactory
   results.
   Search methodologies must focus more on contextual information
   than just keyword occurrences.
           Search term might not a be very differentiating term.
           It might not appear in the section at all.

   SQUINT : an SVM based approach to identify sections of a Web
   page relevant to a Web Search.



17 of 28
Overall Architecure




18 of 28
Feature Generation

   Word Rank Based Features
   Bigram Rank Based Features
   Coverage of Top Ranked Tokens
   Query Word Frequency
   Distance from the Query




19 of 28
Word Rank Based Features

   The rank of a word is defined to be its position in the list if the
   words were ordered by frequency of occurrence across all search
   results.
   The value of this feature is the frequency of the particular word in
   the given section.
   Bucketing can be used to reduce dimensionality.




20 of 28
Bigram Rank Based Features

   A bigram is defined to be two consecutive words occurring in a
   section.
   Eg. Machine learning may be more important than machine and
   learning separately.
   The value of the feature is calculated same as Word Rank Based
   Features.




21 of 28
Coverage of Top Ranked Tokens

   Relevance may also be determined by the number of top ranked
   words which occur in the section.
   The value of this feature is the coverage of top ranked words per
   bucket.




22 of 28
Distance from the Query

   The intuition here is that the closer a section is to the query in the
   Web page, the more likely it is to be relevant.
   The value of this feature is the section-wise distance between the
   section in question and the nearest section which contains the
   query.




23 of 28
Query Word Frequency

   The value of this feature is the frequency of the query word in the
   section.
   The value is normalized by the number of words in the section.




24 of 28
Training Set Generation

   Query Google to get a set of pages
   Clean each page remove scripts, pictures, links etc.
   Break each page into sections.
   Label each section of every page.




25 of 28
Learning Algorithm

   An Support Vector Machine with a linear kernel is used.
   Given the relatively high dimensionality of the feature vector, it is a
   reasonable choice to use an SVM.
   The predicted margins of each sample are used to get a non-binary
   metric of how relevant each sections are.




26 of 28
Conclusion

   Support Vector Machines are an attractive approach to data
   modelling.
   Evaluations suggest that using information retrieval inspired
   features and some basic hints from summarization give respectable
   accuracy with respect to detecting the most relevant section in a
   page.
   Thus SQUINT can have a large impact on the user’s overall search
   experience.




27 of 28
References

   Cristianini, Nello; and Shawe-Taylor, John; An Introduction to
   Support Vector Machines and other kernel-based learning methods,
   Cambridge University Press, 2000.
   Siddharth Jonathan J.B., Riku Inoue and Jyotika Prasad. SQUINT
   SVM for Identification of Relevant Sections in Web Pages for Web
   Search.
   Wikipedia article on Machine Learning,
   http://en.wikipedia.org/wiki/Support vector machine
   Machine Learning Course on Coursera,
   https://class.coursera.org/ml-2012-002/class/index



28 of 28

Identification of Relevant Sections in Web Pages Using a Machine Learning Approach

  • 1.
    Identification of RelevantSections in Web Pages Using a Machine Learning Approach Jerrin Shaji George NIT Calicut November 8, 2012
  • 2.
    Introduction Thereis a massive amount of data available on the internet. Extracting only the relevant content has become very important. A Machine Learning approach is suitable as it can adapt to the rapidly changing dynamics of the internet. 2 of 28
  • 3.
    Machine Learning The science of getting computers to act without being explicitly programmed. A method of teaching computers to make and improve predictions or behaviors based on some data. Machine Learning Algorithms : Supervised Machine Learning Unsupervised Machine Learning 3 of 28
  • 4.
    Supervised Learning Machine learning task of inferring a function from labeled training data. Figure: Supervised Learning Model (courtesy scikit-learn) 4 of 28
  • 5.
    Supervised Learning Example of a classification problem - discrete valued output. Figure: Copyright c Victor Lavrenko 5 of 28
  • 6.
    Supervised Learning Example of a regression problem - continuous valued output. Figure: Copyright c Victor Lavrenko 6 of 28
  • 7.
    Unsupervised Learning The data has no labels. The algorithm tries to find similarities between the objects in question. Figure: Unsupervised Learning Model (courtesy scikit-learn) 7 of 28
  • 8.
    Unsupervised Learning Example of a clustering problem Figure: Copyright c Victor Lavrenko 8 of 28
  • 9.
    Support Vector machines(SVM) A supervised learning model. Used for classification and regression analysis. The basic SVM: A non-probabilistic binary linear classifier. Classifies each given input into one of the two possible classes which forms the output. 9 of 28
  • 10.
    The SVM Algorithm Inputs are formulated as feature vectors. The feature vectors are mapped into a feature space by using a kernel function. A division is computed in the feature space to optimally separate the classes of training vectors. 10 of 28
  • 11.
    The SVM Algorithm φ: The Kernel Function 11 of 28
  • 12.
    Formal Definition ofSVM An SVM constructs a hyperplane or set of hyperplanes in a high- or infinite-dimensional space. It can be used for classification and regression. A good separation is achieved by the hyperplane that has the largest distance to the nearest training data point of any class (called the functional margin). 12 of 28
  • 13.
    Optimal Separating Hyperplane Figure: Courtesy Steve Gunn 13 of 28
  • 14.
    Functional Margin The vectors (points) that constrain the width of the margin are the support vectors. 14 of 28 Figure: Image from scikit-learn
  • 15.
    Mapping to HigherDimensions Sometime data is not linearly separable. If the original finite-dimensional space is mapped into a much higher-dimensional space, the separation is made easier in that space. This is achieved by the SVM using the Kernel Trick. 15 of 28
  • 16.
    Mapping to HigherDimensions Mapping from 1D to 2D Mapping from 2D to 3D 16 of 28 Figure: Coutesy Steve Gunn
  • 17.
    Identification of RelevantSections in a Web Page for Web Search Shallow techniques like keyword matching gives unsatisfactory results. Search methodologies must focus more on contextual information than just keyword occurrences. Search term might not a be very differentiating term. It might not appear in the section at all. SQUINT : an SVM based approach to identify sections of a Web page relevant to a Web Search. 17 of 28
  • 18.
  • 19.
    Feature Generation Word Rank Based Features Bigram Rank Based Features Coverage of Top Ranked Tokens Query Word Frequency Distance from the Query 19 of 28
  • 20.
    Word Rank BasedFeatures The rank of a word is defined to be its position in the list if the words were ordered by frequency of occurrence across all search results. The value of this feature is the frequency of the particular word in the given section. Bucketing can be used to reduce dimensionality. 20 of 28
  • 21.
    Bigram Rank BasedFeatures A bigram is defined to be two consecutive words occurring in a section. Eg. Machine learning may be more important than machine and learning separately. The value of the feature is calculated same as Word Rank Based Features. 21 of 28
  • 22.
    Coverage of TopRanked Tokens Relevance may also be determined by the number of top ranked words which occur in the section. The value of this feature is the coverage of top ranked words per bucket. 22 of 28
  • 23.
    Distance from theQuery The intuition here is that the closer a section is to the query in the Web page, the more likely it is to be relevant. The value of this feature is the section-wise distance between the section in question and the nearest section which contains the query. 23 of 28
  • 24.
    Query Word Frequency The value of this feature is the frequency of the query word in the section. The value is normalized by the number of words in the section. 24 of 28
  • 25.
    Training Set Generation Query Google to get a set of pages Clean each page remove scripts, pictures, links etc. Break each page into sections. Label each section of every page. 25 of 28
  • 26.
    Learning Algorithm An Support Vector Machine with a linear kernel is used. Given the relatively high dimensionality of the feature vector, it is a reasonable choice to use an SVM. The predicted margins of each sample are used to get a non-binary metric of how relevant each sections are. 26 of 28
  • 27.
    Conclusion Support Vector Machines are an attractive approach to data modelling. Evaluations suggest that using information retrieval inspired features and some basic hints from summarization give respectable accuracy with respect to detecting the most relevant section in a page. Thus SQUINT can have a large impact on the user’s overall search experience. 27 of 28
  • 28.
    References Cristianini, Nello; and Shawe-Taylor, John; An Introduction to Support Vector Machines and other kernel-based learning methods, Cambridge University Press, 2000. Siddharth Jonathan J.B., Riku Inoue and Jyotika Prasad. SQUINT SVM for Identification of Relevant Sections in Web Pages for Web Search. Wikipedia article on Machine Learning, http://en.wikipedia.org/wiki/Support vector machine Machine Learning Course on Coursera, https://class.coursera.org/ml-2012-002/class/index 28 of 28