Web mining .pdf module 6 dwm third year ce

Module 6:
Web Mining
- Pradnya Bhangale

Content
• Introduction
• Web Content Mining
• Crawlers
• Harvest System
• Virtual Web View
• Personalization
• Web Structure Mining: PageRank, Clever
• Web Usage Mining

Introduction: Web Mining
• Application of data mining techniques to find information
patterns from the web data like web documents, web
contents, hyperlinks and server logs
• Web data can include:
• Content of actual web pages
• Intra page structure which includes HTML or XML nodes
for the page
• Inter page structure which is the actual linkage structure
between web pages
• Usage data that describe how web pages are accessed by the
visitors
• User profile data include users profile, registration
information, and cookies

• Contents of web data mined may consist of text, structured
data such as lists and tables, and even images, video and
audio
• Goal of web mining:
• Look for patterns in Web data by collecting and analyzing
information in order to gain insight into trends, the industry
and users in general
Introduction (continued)

Web Mining: Applications
• Helps to improve power of search engines such as Google,
Yahoo etc. by classifying web documents and identifying the
web pages
• Used to predict user behavior
• Landing page optimization
• Useful for e-commerce websites and e-services

• Web Content Mining: used for mining of useful data,
information and knowledge from web page content
• Web Structure Mining: helps to find useful knowledge or
information pattern from the structure of hyperlinks
• Web Usage Mining: used for mining the web log
records(access information of web pages) and helps to discover
the user access patterns of web pages
Web Mining: Techniques

Web ContentMining
• Process of mining useful information from the contents of
the web pages / web documents – text, image, audio, video
etc.
• Based on content of the input query, Web content mining
performs scanning and mining of the text, images, and display
in search engines group of web pages
• Ex: if user search for particular book in search engine then
search engine provides list of suggestions
• There are many techniques to extract the data like web
scraping.
• Scrapy and Octoparse are well known tools that perform web
content mining process

1. Crawlers
• Traditional search engines use crawlers to search the Web and
gather information, indexing techniques to store information
and query processing to provide fast and accurate information
to users
• Web crawler is a program that acts as an automated script
which browse through the internet in a systematic way
• Primarily programmed for repetitive actions so that browsing
is automated
• Search engines use crawlers most frequently to browse
internet and build an index

• Web crawlers are keyword-based, it looks at the keywords in
the pages, the kind of the content each page has and the links,
before returning the information to search engine. This
process is known as Web Crawling
1. Crawlers
Crawler Frontier: Stores list of
URL’s to visit
Page Downloader: download
pages from world wide web
Web Repository: receives web
pages from crawler and stores in
database

• These web crawlers go by different names, like bots,
automatic indexers and robots.
• For example, Google's search engine use crawlers to fetch
those pages to Google's servers.
• Some of the popular web crawlers are
• Googlebot
• Scrapy (the Python Scraper)
• Storm crawler
• Elasticsearch River Web, etc.
1. Crawlers

Web Crawlers: Working
• The spider begins its crawl by going through the websites or
list of websites that is visited previously
• When crawler visit a website, they search for other pages that
are worth visiting
• Web crawlers can link to new sites, note changes to existing
sites and mark dead links
Google Search - How It works ?
• From trillions of pages on world wide web, Web Crawlers
crawl through pages to bring back the results demanded by
customers
• Site owners can decide which of their pages they want the
web crawlers to index, and they can block the pages that need
not be indexed.

• The indexing is done by sorting the pages and looking at the
quality of the content and other factors.
• Google then generates algorithms to get a better view of
what you are searching for, and provides a number of features
that make your search more effective, such as:
• Spelling : Incase there is an error in the word you typed,
Google comes up with several alternatives to help you get on
track
• Google Instant :Instant results as you type.
• Search methods :Different options for searching, other than
just typing out the words. This includes images and voice
search.
Working of Web Crawlers

• Synonyms : Tackles similar worded meanings and produces
results
• Auto complete : Anticipates what you need from what you
type.
• Query understanding: An in-depth understanding of what
you type
Working of Web Crawlers

Types of Crawlers
• Periodic crawler : A traditional crawler, in order to refresh
its collection, periodically replaces the old documents
with the newly downloaded documents. As it is activated
periodically every time it is activated it replaces the existing
index
• Incremental Crawler: crawler incrementally refreshes the
existing collection of pages by visiting them frequently and
updates the index incrementally instead of replacing it
• Focused Crawler: This web crawler tries to download the
web pages that are related to each other i.e. it visits pages
related to topic of interest. This is also known as topic crawler

Web Crawler: Applications
• Price comparison portals search
• A crawler may collect publicly available e-mail or postal
addresses of companies for target advertising
• Web analysis tools use crawlers to collect data for page views,
or incoming or outbound links.
• Crawlers serve to provide information hubs with data, for
example, news sites.

2. Harvest Systems
• Data harvesting uses a process that extracts and analyzes data
collected from online sources
• Based on the use of caching, indexing, and crawling.
• Harvest is actually a set of tools that facilitate gathering of
information from diverse sources.
• For data harvesting, a website is targeted, and the data from that
site is extracted.
• Might be simple text found on the page or within the page's
code.
• Could be directory information from a retail site.
• Might even be a series of images and videos.

• The harvest design is centered around the use of gatherers
and brokers.
• A gatherer obtains information for indexing from an Internet
service provider
• Broker provides the index and query interface
• Brokers may interface directly with gatherers or may go
through other brokers to get to the gatherers.
2. Harvest Systems

3. Virtual Web View
• Web Data Mining Query Language
• Provides data mining operations on MLDB

• Web personalization is the process of customizing a web site
to the needs of each specific user or set of users like
• Provision of recommendations to the users
• Highlighting/adding links
• Creation of index pages, etc.
• The web personalization systems are mainly based on the
exploitation of the navigational patterns of the website's
visitors.
• The process of providing information that is related to
user's current page is known as web personalization.
4. Personalization

• For example: e-commerce
• The key information that is required for suggesting these similar
web pages comes from
• Knowledge of other users who have also visited the current
page
• As well as web page content, the structure of the web page or
the user's profile information.
• All these help in creating a focused and personalized web
browsing experience for the user.
4. Personalization

• The web personalization process can be
divided into four phases namely
1. Data collection
2. Pre-processing of web data
3. Analysis of web data
4. Decision making or recommendation.
4. Web PersonalizationPhases

1. Data collection
• Data collection is the process of gathering information
either explicitly or implicitly specific to each visitor for
recording their interests and behavior while they browse a
web site
• Implicit data: collection of activities completed in the past and
recorded in web server logs
• Explicit data: information submitted by user at the time of
registration or in response to rating questionnaires
• Web data in the form of content, structure, semantic, usage
and user profile may be collected

Phase 2: Preprocessing of Data
• Log data collected from web server are text files with row for
each http transactions
• These data needs to be cleansed before putting them for
analysis
• Preprocessing filters out irrelevant information according to
goal of analysis
Phase 3: Data Analysis/ Mining
• Specific data mining techniques which are used for mining of
web data are applied to the pre-processed data to discover
interesting usage patterns.
• It classifies the content of a web site into semantic categories
in order to make information retrieval and presentation
easier for user

Phase 4: Recommendation Phase
• This last phase usually performs recommendation to users by
determining existing hyperlinks, dynamic insertion of new
hyperlinks that seems to be interest for current user to last
web page requested by user or even creation of new index
pages

Types of Personalization
• There are three approaches for generating a personalized web
experience for a user:
• Content based Filtering
• Collaborative Filtering
• Model based Techniques
• Memory based Techniques

1. ContentBased Filtering
• The approach to recommendation
generation is based around the
analysis of items previously rated
by a user and generating a profile
for a user based on the content
descriptions of these items.
• Several early recommender
systems were based on content-
based filtering including Personal
Web Watcher, Info Finder,
Newsreaders, Letizia and Syskill
and Webert.

2. Collaborative Filtering
• The basic idea as presented by
Goldberg et al. was that people
collaborate to help each other
perform filtering by recording their
reactions to e-mails in the 'form of
annotations.
• Users provide feedback on the items
that they consume, in the form ratings.
• To recommend items to the active user,
previous feedback is used to find other
likeminded users.
• Items that have been consumed by
compatible users but not by
the current user are candidates for
recommendation.

3. Model based Techniques
• Model based collaborative techniques use a two-stage process
for recommendation
• The first stage is carried out offline, where user behavioral
data collected during previous interactions is mined and an
explicit model generated for use in future online interactions.
• The second stage is carried out in real-time as a new visitor
begins an interaction with the Web site.
• Data from the current user session is scored using the models
generated offline, and recommendations generated based on this
scoring.

Model based vs. Memory based
Techniques

Web Structure Mining
• Web structure mining is used for creating a model of
web organization
• Process of analyzing the nodes and connection structure
of a website using graph theory

Web Structure Mining
Why?
• Used to classify web pages
• Helpful to create information such as relationship and
similarity between different websites
• Useful for discovering website types
• Authority Sites: information about the subject
• Hubs sites: point to many authority sites

Algorithms for Web StructureMining
PageRank algorithm (Google Founders)
• Looks at number of links to a website and importance of
referring links
• Computed before the user enters the query.
HITS algorithm (Hyperlinked Induced Topic Search)
• User receives two lists of pages for query (authority and
hubpages)
• Computations are done after the user enters the query

PageRank Algorithm
• The idea of the algorithm came from academic citation
literature.
• It was developed in 1998 as part of the Google search
engine prototype
• Studies citation relationship of documents within the web.
• Google search engine ranks documents as a function of both
the query terms and the hyperlink structure of the web

Definition of PageRank
• The Page Rank produces ranking independent of a user's
query.
• The importance of a web page is determined by the
number of other important web pages that are pointing
to that page and the number of out links from other web
pages

Examples of Backlinks
• Page A is a inlinks of page B and
page C, while page B and page C
are inlinks of page D.

HITS Algorithm (Hyperlink
Induced Topic Search)

Comparison
Algorithm PageRank HITS
Mining
Technique Used
Web structure mining Web structure and web
content mining
Working • Computes scores at
indexing time
• Results are sorted
according to importance
of pages
• Computes hub and
authority scores of n
highly relevant pages
on the fly
Applied on Entire Web Local neighborhood of
pages surrounding results
of a query
Input parameters Back links Back links, Forward links
and content
Complexity O(log N) O(log N)
Limitations Query Independent Efficiency problem
Search Engine Google CLEVER

Web Usage Mining
• Web usage Mining is process of extracting patterns and
information from server logs to gain insights on user activity
like:
• where the users are from
• how many users clicked what item on the site and
• types of activities being done on the site.
• Web server logs are considered as a raw data in return
meaningful data are extracted and patterns are identified.
• For instance, for any e-commerce business, when they want to
increase the scope of business, user's web activity through the
application logs are monitored and data mining is applied to it.

• Some of the techniques to discover and analyze the web usage
pattern are :
Session and visitor analysis
• The analysis of pre-processed data can be performed in
session analysis, which includes the record of visitors, days
and sessions etc.
• Information can be used to analyze behavior of visitors
• Report generated after analysis, contains details of frequently
visited web pages, common entry and exits
Web Usage Mining

OLAP (Online Analytical Processing):
• OLAP performs Multidimensional analysis of complex data
• OLAP can be performed on different parts of log related data
in interval of time.
• The OLAP tool can be used to derive important business
intelligence metrics
Web Usage Mining

1. Preprocessing:
• Preprocessing consists of converting the usage, content, and
structure information contained in the various available data
sources into the data abstractions necessary for pattern
discovery
• Usage Preprocessing :
• Usage preprocessing is most difficult task in the Web
Usage Mining process due to the incompleteness of the
available data.
• Unless a client-side tracking mechanism is used, only the
IP address, agent, and Server-side click stream are
available to identify users and server sessions.
Web Usage Mining Process

• Content Preprocessing:
• Consist of converting the text, image, scripts, and
other multimedia into forms that are useful for web usage
mining
• It consists of performing content mining such a
classification or clustering.
• Structure Preprocessing:
• Structure of website is created by hypertext links between
page views
• Structure can be preprocessed in same manner as content

2. Pattern Discovery
• Pattern discovery uses methods and algorithms developed in
several domains like statistics, data mining, machine learning
and pattern recognition
• Statistical Analysis: extract knowledge about visitors by
performing descriptive statistical analysis frequency of page
views, viewing time and length of navigational path
• Association Rules: set of pages that are accessed together with
min support count
• Clustering: Two kinds of interesting clusters to mine: usage
cluster and page cluster

• Classification: Classify user profiles into different
class/category based on their browsing activity
• Sequential Patterns: Web marketers can predict future visit
patterns which help in placing advertisement aimed for certain
group of users
• Dependency modeling: develop model capable of
representing significant dependencies among various variables
in web domain

3. Pattern Analysis:
• Filter out uninteresting rules and patterns from set found in
pattern discovery phase
• Load usage data into data cube to perform OLAP operations
• Visualization techniques like graphs or assign colors to
different values can highlight overall pattern

Web mining .pdf module 6 dwm third year ce

More Related Content

What's hot

Similar to Web mining .pdf module 6 dwm third year ce

More from NiramayKolalle

Recently uploaded

Web mining .pdf module 6 dwm third year ce