Mining the WorldWideWeb
Mining the World WideWeb
 The WorldWideWeb serves as a huge, widely distributed, global
information service center.
 Some of the information services - news, advertisements,
financial management, education, government, e-commerce
 Web - rich and dynamic collection of hyperlink information,
providing rich sources for data mining.
Based on the following observations, the Web also poses great
challenges in knowledge discovery.
 The Web seems to be too huge for effective data
warehousing and data mining.
– size of the web and data storage
 The complexity of Web pages is far greater than that of any
traditional text document collection.
– searching information
 The Web is a highly dynamic information source.
– information is constantly updated
The Web serves a broad diversity of user communities
– User’s interests, their backgrounds and the usage purposes
 Only a small portion of the information on the Web is truly
relevant or useful.
- information may uninteresting to the user and may swamp
desired search results.
. Index-based Web search engines
Disadvantages:
 Only an experienced user may be able to quickly locate
documents by providing a set of tightly constrained keywords
and phrases.
 huge number of document entries returned, marginally relevant
to the topic or may contain materials of poor quality
 Polysemy problem - many documents that are highly
relevant to a topic may not contain keywords defining
them.
 For example, the keyword Java may refer to the Java
programming language, or an island in Indonesia, or brewed
coffee.
A simple keyword based Web search engine is not sufficient for
Web resource discovery.
 Compared with keyword-based Web search,
 Web mining is a more challenging task
– that searches for Web structures,
– ranks the importance of Web contents,
– discovers the regularity and dynamics of Web contents
 Web mining- identify authoritative Web pages, classify Web
documents, and resolve many ambiguities raised in keyword-
based Web search
Web mining tasks can be classified into three categories:
 Web content mining,
 Web structure mining and
 Web usage mining
Issues related to Web mining:
 Mining the Web page layout structure
 Mining the Web’s link structures
 Mining multimedia data on the Web
 Automatic classification of Web documents
 Weblog mining
Mining the Web Page Layout Structure
• The basic structure of a Web page is DOM(Document Object
Model) structure.
– The DOM structure of a Web page is a tree structure, where every
HTML tag in the page corresponds to a node in the DOM tree.
– The Web page can be segmented by some predefined structural tags.
– Two nodes in the DOM tree have the same parent.
– The two nodes might not be more semantically related to each other
than to other nodes.
– DOM tree was initially introduced for presentation in the browser.
• The DOM tree structure fails to correctly identify the
semantic relationships between different parts.
 User always expect that certain functional parts of a Web page
(e.g., navigational links or an advertisement bar) appear at
certain positions on the page.
 When a Web page is presented to the user, the spatial and visual
cues can help the user to divide the Web page into several
semantic parts.
 An algorithm to extract the Web page content structure based
on spatial and visual information.
– VIsion-based Page Segmentation (VIPS).
– VIPS aims to extract the semantic structure of a Web page based on its
visual presentation.
 Semantic structure is a tree structure: each node in the tree corresponds
to a block.
 Each node will be assigned a value (Degree of Coherence) to indicate
how coherent is the content in the block based on visual perception.
 It first extracts all of the suitable blocks from the HTML DOM tree, and
then it finds the separators between these blocks.
 Here separators denote the horizontal or vertical lines in a Web page that
visually cross with no blocks.
 Based on these separators, the semantic tree of the Webpage is
constructed.
 Compared with DOM-based methods, the segments obtained by VIPS are
more semantically aggregated.
 Contents with different topics are distinguished as separate blocks.
Mining the Web’s Link Structures to Identify
Authoritative Web Pages
• How can a search engine automatically identify authoritative
Web pages for the topic?
– The Web consists not only of pages, but also of hyperlinks pointing
from one page to another.
– When an author of a Web page creates a hyperlink pointing to
another Web page, this can be considered as the author’s
endorsement of the other page.
– The collective endorsement of a given page by different authors on
the Web may indicate the importance of the page and may naturally
lead to the discovery of authoritative Web pages.
– The Web linkage information provides rich information about the
relevance, the quality, and the structure of the Web’s contents, and
thus is a rich source for Web mining.
 A hub is one or a set of Web pages that provides collections
of links to authorities.
 Hub pages may not be prominent, or there may exist few
links pointing to them
-could be lists of recommended links on individual home pages, such
as recommended reference sites from a course home page
 Hub pages play the role of implicitly conferring authorities on
a focused topic.
A good hub is a page that points to many good authorities.
A good authority is a page pointed to by many good hubs.
 The relationship between hubs and authorities
helps the mining of authoritative Web pages and automated discovery
of high-quality Web structures and resources.
How can we use hub pages to find authoritative
pages?
 HITS(Hyperlink-Induced Topic Search)
uses the query terms to collect a starting set of 200 pages from an
index-based search engine.
 These pages form the root set.
 Many of these pages are presumably relevant to the search
topic
 contains links to most of the prominent authorities.
 A weight-propagation phase is initiated.
 This iterative process determines numerical estimates of hub
and authority weights.
 The links between two pages with the same Web domain
-often serve as a navigation function and thus do not confer authority.
 Such links are excluded from the weight-propagation analysis.
 We first associate a non-negative authority weight ap and a non-negative
hubweight hp, with each page p in the base set, and initialize all a and h
values to a uniform constant.
 The weights are normalized and an invariant is maintained
that the squares of all weights sum to 1.
The authority and hub weights are updated based on the
following equations:
 ap implies that if a page is pointed to by many good hubs, its
authority weight should increase
It is the sum of the current hub weights of all of the pages pointing to
it.
 hp implies that if a page is pointing to many good authorities,
its hub weight should increase
 It is the sum of the current authority weights of all of the pages it
points to.

4.5 mining the worldwideweb

  • 1.
  • 2.
    Mining the WorldWideWeb  The WorldWideWeb serves as a huge, widely distributed, global information service center.  Some of the information services - news, advertisements, financial management, education, government, e-commerce  Web - rich and dynamic collection of hyperlink information, providing rich sources for data mining.
  • 3.
    Based on thefollowing observations, the Web also poses great challenges in knowledge discovery.  The Web seems to be too huge for effective data warehousing and data mining. – size of the web and data storage  The complexity of Web pages is far greater than that of any traditional text document collection. – searching information  The Web is a highly dynamic information source. – information is constantly updated
  • 4.
    The Web servesa broad diversity of user communities – User’s interests, their backgrounds and the usage purposes  Only a small portion of the information on the Web is truly relevant or useful. - information may uninteresting to the user and may swamp desired search results.
  • 5.
    . Index-based Websearch engines Disadvantages:  Only an experienced user may be able to quickly locate documents by providing a set of tightly constrained keywords and phrases.  huge number of document entries returned, marginally relevant to the topic or may contain materials of poor quality  Polysemy problem - many documents that are highly relevant to a topic may not contain keywords defining them.  For example, the keyword Java may refer to the Java programming language, or an island in Indonesia, or brewed coffee.
  • 6.
    A simple keywordbased Web search engine is not sufficient for Web resource discovery.  Compared with keyword-based Web search,  Web mining is a more challenging task – that searches for Web structures, – ranks the importance of Web contents, – discovers the regularity and dynamics of Web contents  Web mining- identify authoritative Web pages, classify Web documents, and resolve many ambiguities raised in keyword- based Web search Web mining tasks can be classified into three categories:  Web content mining,  Web structure mining and  Web usage mining
  • 7.
    Issues related toWeb mining:  Mining the Web page layout structure  Mining the Web’s link structures  Mining multimedia data on the Web  Automatic classification of Web documents  Weblog mining
  • 8.
    Mining the WebPage Layout Structure • The basic structure of a Web page is DOM(Document Object Model) structure. – The DOM structure of a Web page is a tree structure, where every HTML tag in the page corresponds to a node in the DOM tree. – The Web page can be segmented by some predefined structural tags. – Two nodes in the DOM tree have the same parent. – The two nodes might not be more semantically related to each other than to other nodes. – DOM tree was initially introduced for presentation in the browser. • The DOM tree structure fails to correctly identify the semantic relationships between different parts.
  • 10.
     User alwaysexpect that certain functional parts of a Web page (e.g., navigational links or an advertisement bar) appear at certain positions on the page.  When a Web page is presented to the user, the spatial and visual cues can help the user to divide the Web page into several semantic parts.  An algorithm to extract the Web page content structure based on spatial and visual information. – VIsion-based Page Segmentation (VIPS). – VIPS aims to extract the semantic structure of a Web page based on its visual presentation.
  • 11.
     Semantic structureis a tree structure: each node in the tree corresponds to a block.  Each node will be assigned a value (Degree of Coherence) to indicate how coherent is the content in the block based on visual perception.  It first extracts all of the suitable blocks from the HTML DOM tree, and then it finds the separators between these blocks.  Here separators denote the horizontal or vertical lines in a Web page that visually cross with no blocks.  Based on these separators, the semantic tree of the Webpage is constructed.  Compared with DOM-based methods, the segments obtained by VIPS are more semantically aggregated.  Contents with different topics are distinguished as separate blocks.
  • 13.
    Mining the Web’sLink Structures to Identify Authoritative Web Pages • How can a search engine automatically identify authoritative Web pages for the topic? – The Web consists not only of pages, but also of hyperlinks pointing from one page to another. – When an author of a Web page creates a hyperlink pointing to another Web page, this can be considered as the author’s endorsement of the other page. – The collective endorsement of a given page by different authors on the Web may indicate the importance of the page and may naturally lead to the discovery of authoritative Web pages. – The Web linkage information provides rich information about the relevance, the quality, and the structure of the Web’s contents, and thus is a rich source for Web mining.
  • 14.
     A hubis one or a set of Web pages that provides collections of links to authorities.  Hub pages may not be prominent, or there may exist few links pointing to them -could be lists of recommended links on individual home pages, such as recommended reference sites from a course home page  Hub pages play the role of implicitly conferring authorities on a focused topic. A good hub is a page that points to many good authorities. A good authority is a page pointed to by many good hubs.  The relationship between hubs and authorities helps the mining of authoritative Web pages and automated discovery of high-quality Web structures and resources.
  • 15.
    How can weuse hub pages to find authoritative pages?  HITS(Hyperlink-Induced Topic Search) uses the query terms to collect a starting set of 200 pages from an index-based search engine.  These pages form the root set.  Many of these pages are presumably relevant to the search topic  contains links to most of the prominent authorities.
  • 16.
     A weight-propagationphase is initiated.  This iterative process determines numerical estimates of hub and authority weights.  The links between two pages with the same Web domain -often serve as a navigation function and thus do not confer authority.  Such links are excluded from the weight-propagation analysis.  We first associate a non-negative authority weight ap and a non-negative hubweight hp, with each page p in the base set, and initialize all a and h values to a uniform constant.
  • 17.
     The weightsare normalized and an invariant is maintained that the squares of all weights sum to 1. The authority and hub weights are updated based on the following equations:  ap implies that if a page is pointed to by many good hubs, its authority weight should increase It is the sum of the current hub weights of all of the pages pointing to it.  hp implies that if a page is pointing to many good authorities, its hub weight should increase  It is the sum of the current authority weights of all of the pages it points to.