Module 6:
Web Mining
- Pradnya Bhangale
Content
• Introduction
• Web Content Mining
• Crawlers
• Harvest System
• Virtual Web View
• Personalization
• Web Structure Mining: PageRank, Clever
• Web Usage Mining
Introduction: Web Mining
• Application of data mining techniques to find information
patterns from the web data like web documents, web
contents, hyperlinks and server logs
• Web data can include:
• Content of actual web pages
• Intra page structure which includes HTML or XML nodes
for the page
• Inter page structure which is the actual linkage structure
between web pages
• Usage data that describe how web pages are accessed by the
visitors
• User profile data include users profile, registration
information, and cookies
• Contents of web data mined may consist of text, structured
data such as lists and tables, and even images, video and
audio
• Goal of web mining:
• Look for patterns in Web data by collecting and analyzing
information in order to gain insight into trends, the industry
and users in general
Introduction (continued)
Web Mining: Applications
• Helps to improve power of search engines such as Google,
Yahoo etc. by classifying web documents and identifying the
web pages
• Used to predict user behavior
• Landing page optimization
• Useful for e-commerce websites and e-services
Web Mining: Techniques
• Web Content Mining: used for mining of useful data,
information and knowledge from web page content
• Web Structure Mining: helps to find useful knowledge or
information pattern from the structure of hyperlinks
• Web Usage Mining: used for mining the web log
records(access information of web pages) and helps to discover
the user access patterns of web pages
Web Mining: Techniques
Web ContentMining
• Process of mining useful information from the contents of
the web pages / web documents – text, image, audio, video
etc.
• Based on content of the input query, Web content mining
performs scanning and mining of the text, images, and display
in search engines group of web pages
• Ex: if user search for particular book in search engine then
search engine provides list of suggestions
• There are many techniques to extract the data like web
scraping.
• Scrapy and Octoparse are well known tools that perform web
content mining process
1. Crawlers
• Traditional search engines use crawlers to search the Web and
gather information, indexing techniques to store information
and query processing to provide fast and accurate information
to users
• Web crawler is a program that acts as an automated script
which browse through the internet in a systematic way
• Primarily programmed for repetitive actions so that browsing
is automated
• Search engines use crawlers most frequently to browse
internet and build an index
1. Crawlers: Workflow
• Web crawlers are keyword-based, it looks at the keywords in
the pages, the kind of the content each page has and the links,
before returning the information to search engine. This
process is known as Web Crawling
1. Crawlers
Crawler Frontier: Stores list of
URL’s to visit
Page Downloader: download
pages from world wide web
Web Repository: receives web
pages from crawler and stores in
database
• These web crawlers go by different names, like bots,
automatic indexers and robots.
• For example, Google's search engine use crawlers to fetch
those pages to Google's servers.
• Some of the popular web crawlers are
• Googlebot
• Scrapy (the Python Scraper)
• Storm crawler
• Elasticsearch River Web, etc.
1. Crawlers
Web Crawlers: Working
• The spider begins its crawl by going through the websites or
list of websites that is visited previously
• When crawler visit a website, they search for other pages that
are worth visiting
• Web crawlers can link to new sites, note changes to existing
sites and mark dead links
Google Search - How It works ?
• From trillions of pages on world wide web, Web Crawlers
crawl through pages to bring back the results demanded by
customers
• Site owners can decide which of their pages they want the
web crawlers to index, and they can block the pages that need
not be indexed.
• The indexing is done by sorting the pages and looking at the
quality of the content and other factors.
• Google then generates algorithms to get a better view of
what you are searching for, and provides a number of features
that make your search more effective, such as:
• Spelling : Incase there is an error in the word you typed,
Google comes up with several alternatives to help you get on
track
• Google Instant :Instant results as you type.
• Search methods :Different options for searching, other than
just typing out the words. This includes images and voice
search.
Working of Web Crawlers
• Synonyms : Tackles similar worded meanings and produces
results
• Auto complete : Anticipates what you need from what you
type.
• Query understanding: An in-depth understanding of what
you type
Working of Web Crawlers
Types of Crawlers
• Periodic crawler : A traditional crawler, in order to refresh
its collection, periodically replaces the old documents
with the newly downloaded documents. As it is activated
periodically every time it is activated it replaces the existing
index
• Incremental Crawler: crawler incrementally refreshes the
existing collection of pages by visiting them frequently and
updates the index incrementally instead of replacing it
• Focused Crawler: This web crawler tries to download the
web pages that are related to each other i.e. it visits pages
related to topic of interest. This is also known as topic crawler
Types of Crawlers
Web Crawler: Applications
• Price comparison portals search
• A crawler may collect publicly available e-mail or postal
addresses of companies for target advertising
• Web analysis tools use crawlers to collect data for page views,
or incoming or outbound links.
• Crawlers serve to provide information hubs with data, for
example, news sites.
2. Harvest Systems
• Data harvesting uses a process that extracts and analyzes data
collected from online sources
• Based on the use of caching, indexing, and crawling.
• Harvest is actually a set of tools that facilitate gathering of
information from diverse sources.
• For data harvesting, a website is targeted, and the data from that
site is extracted.
• Might be simple text found on the page or within the page's
code.
• Could be directory information from a retail site.
• Might even be a series of images and videos.
• The harvest design is centered around the use of gatherers
and brokers.
• A gatherer obtains information for indexing from an Internet
service provider
• Broker provides the index and query interface
• Brokers may interface directly with gatherers or may go
through other brokers to get to the gatherers.
2. Harvest Systems
2. Harvest Systems
3. Virtual Web View
3. Virtual Web View
• Web Data Mining Query Language
• Provides data mining operations on MLDB
• Web personalization is the process of customizing a web site
to the needs of each specific user or set of users like
• Provision of recommendations to the users
• Highlighting/adding links
• Creation of index pages, etc.
• The web personalization systems are mainly based on the
exploitation of the navigational patterns of the website's
visitors.
• The process of providing information that is related to
user's current page is known as web personalization.
4. Personalization
• For example: e-commerce
• The key information that is required for suggesting these similar
web pages comes from
• Knowledge of other users who have also visited the current
page
• As well as web page content, the structure of the web page or
the user's profile information.
• All these help in creating a focused and personalized web
browsing experience for the user.
4. Personalization
4. Personalization
• The web personalization process can be
divided into four phases namely
1. Data collection
2. Pre-processing of web data
3. Analysis of web data
4. Decision making or recommendation.
4. Web PersonalizationPhases
4. Web PersonalizationPhases
1. Data collection
• Data collection is the process of gathering information
either explicitly or implicitly specific to each visitor for
recording their interests and behavior while they browse a
web site
• Implicit data: collection of activities completed in the past and
recorded in web server logs
• Explicit data: information submitted by user at the time of
registration or in response to rating questionnaires
• Web data in the form of content, structure, semantic, usage
and user profile may be collected
Phase 2: Preprocessing of Data
• Log data collected from web server are text files with row for
each http transactions
• These data needs to be cleansed before putting them for
analysis
• Preprocessing filters out irrelevant information according to
goal of analysis
Phase 3: Data Analysis/ Mining
• Specific data mining techniques which are used for mining of
web data are applied to the pre-processed data to discover
interesting usage patterns.
• It classifies the content of a web site into semantic categories
in order to make information retrieval and presentation
easier for user
4. Web PersonalizationPhases
Phase 4: Recommendation Phase
• This last phase usually performs recommendation to users by
determining existing hyperlinks, dynamic insertion of new
hyperlinks that seems to be interest for current user to last
web page requested by user or even creation of new index
pages
4. Web PersonalizationPhases
Types of Personalization
• There are three approaches for generating a personalized web
experience for a user:
• Content based Filtering
• Collaborative Filtering
• Model based Techniques
• Memory based Techniques
1. ContentBased Filtering
• The approach to recommendation
generation is based around the
analysis of items previously rated
by a user and generating a profile
for a user based on the content
descriptions of these items.
• Several early recommender
systems were based on content-
based filtering including Personal
Web Watcher, Info Finder,
Newsreaders, Letizia and Syskill
and Webert.
2. Collaborative Filtering
• The basic idea as presented by
Goldberg et al. was that people
collaborate to help each other
perform filtering by recording their
reactions to e-mails in the 'form of
annotations.
• Users provide feedback on the items
that they consume, in the form ratings.
• To recommend items to the active user,
previous feedback is used to find other
likeminded users.
• Items that have been consumed by
compatible users but not by
the current user are candidates for
recommendation.
3. Model based Techniques
• Model based collaborative techniques use a two-stage process
for recommendation
• The first stage is carried out offline, where user behavioral
data collected during previous interactions is mined and an
explicit model generated for use in future online interactions.
• The second stage is carried out in real-time as a new visitor
begins an interaction with the Web site.
• Data from the current user session is scored using the models
generated offline, and recommendations generated based on this
scoring.
Model based vs. Memory based
Techniques
Web Structure Mining
Web Structure Mining
• Web structure mining is used for creating a model of
web organization
• Process of analyzing the nodes and connection structure
of a website using graph theory
Web Structure Mining
Why?
• Used to classify web pages
• Helpful to create information such as relationship and
similarity between different websites
• Useful for discovering website types
• Authority Sites: information about the subject
• Hubs sites: point to many authority sites
Algorithms for Web StructureMining
PageRank algorithm (Google Founders)
• Looks at number of links to a website and importance of
referring links
• Computed before the user enters the query.
HITS algorithm (Hyperlinked Induced Topic Search)
• User receives two lists of pages for query (authority and
hubpages)
• Computations are done after the user enters the query
PageRank Algorithm
• The idea of the algorithm came from academic citation
literature.
• It was developed in 1998 as part of the Google search
engine prototype
• Studies citation relationship of documents within the web.
• Google search engine ranks documents as a function of both
the query terms and the hyperlink structure of the web
Definition of PageRank
• The Page Rank produces ranking independent of a user's
query.
• The importance of a web page is determined by the
number of other important web pages that are pointing
to that page and the number of out links from other web
pages
Examples of Backlinks
• Page A is a inlinks of page B and
page C, while page B and page C
are inlinks of page D.
Page Ranking
Damping Factor d
Computing PageRank
PageRank Algorithm
PageRank Numerical
HITS Algorithm (Hyperlink
Induced Topic Search)
Authorities and Hubs
HITS Algorithm
Difference from PageRank
HITS Algorithm
HITS Numerical
Authority and Hubness
Numerical Example
Comparison
Algorithm PageRank HITS
Mining
Technique Used
Web structure mining Web structure and web
content mining
Working • Computes scores at
indexing time
• Results are sorted
according to importance
of pages
• Computes hub and
authority scores of n
highly relevant pages
on the fly
Applied on Entire Web Local neighborhood of
pages surrounding results
of a query
Input parameters Back links Back links, Forward links
and content
Complexity O(log N) O(log N)
Limitations Query Independent Efficiency problem
Search Engine Google CLEVER
CLEVER Algorithm
Web Usage Mining
Web Usage Mining
• Web usage Mining is process of extracting patterns and
information from server logs to gain insights on user activity
like:
• where the users are from
• how many users clicked what item on the site and
• types of activities being done on the site.
• Web server logs are considered as a raw data in return
meaningful data are extracted and patterns are identified.
• For instance, for any e-commerce business, when they want to
increase the scope of business, user's web activity through the
application logs are monitored and data mining is applied to it.
• Some of the techniques to discover and analyze the web usage
pattern are :
Session and visitor analysis
• The analysis of pre-processed data can be performed in
session analysis, which includes the record of visitors, days
and sessions etc.
• Information can be used to analyze behavior of visitors
• Report generated after analysis, contains details of frequently
visited web pages, common entry and exits
Web Usage Mining
OLAP (Online Analytical Processing):
• OLAP performs Multidimensional analysis of complex data
• OLAP can be performed on different parts of log related data
in interval of time.
• The OLAP tool can be used to derive important business
intelligence metrics
Web Usage Mining
Web Usage Mining Process
1. Preprocessing:
• Preprocessing consists of converting the usage, content, and
structure information contained in the various available data
sources into the data abstractions necessary for pattern
discovery
• Usage Preprocessing :
• Usage preprocessing is most difficult task in the Web
Usage Mining process due to the incompleteness of the
available data.
• Unless a client-side tracking mechanism is used, only the
IP address, agent, and Server-side click stream are
available to identify users and server sessions.
Web Usage Mining Process
• Content Preprocessing:
• Consist of converting the text, image, scripts, and
other multimedia into forms that are useful for web usage
mining
• It consists of performing content mining such a
classification or clustering.
• Structure Preprocessing:
• Structure of website is created by hypertext links between
page views
• Structure can be preprocessed in same manner as content
Web Usage Mining Process
2. Pattern Discovery
• Pattern discovery uses methods and algorithms developed in
several domains like statistics, data mining, machine learning
and pattern recognition
• Statistical Analysis: extract knowledge about visitors by
performing descriptive statistical analysis frequency of page
views, viewing time and length of navigational path
• Association Rules: set of pages that are accessed together with
min support count
• Clustering: Two kinds of interesting clusters to mine: usage
cluster and page cluster
Web Usage Mining Process
• Classification: Classify user profiles into different
class/category based on their browsing activity
• Sequential Patterns: Web marketers can predict future visit
patterns which help in placing advertisement aimed for certain
group of users
• Dependency modeling: develop model capable of
representing significant dependencies among various variables
in web domain
Web Usage Mining Process
3. Pattern Analysis:
• Filter out uninteresting rules and patterns from set found in
pattern discovery phase
• Load usage data into data cube to perform OLAP operations
• Visualization techniques like graphs or assign colors to
different values can highlight overall pattern
Web Usage Mining Process
Thank You!!

Web mining .pdf module 6 dwm third year ce

  • 1.
    Module 6: Web Mining -Pradnya Bhangale
  • 2.
    Content • Introduction • WebContent Mining • Crawlers • Harvest System • Virtual Web View • Personalization • Web Structure Mining: PageRank, Clever • Web Usage Mining
  • 3.
    Introduction: Web Mining •Application of data mining techniques to find information patterns from the web data like web documents, web contents, hyperlinks and server logs • Web data can include: • Content of actual web pages • Intra page structure which includes HTML or XML nodes for the page • Inter page structure which is the actual linkage structure between web pages • Usage data that describe how web pages are accessed by the visitors • User profile data include users profile, registration information, and cookies
  • 4.
    • Contents ofweb data mined may consist of text, structured data such as lists and tables, and even images, video and audio • Goal of web mining: • Look for patterns in Web data by collecting and analyzing information in order to gain insight into trends, the industry and users in general Introduction (continued)
  • 5.
    Web Mining: Applications •Helps to improve power of search engines such as Google, Yahoo etc. by classifying web documents and identifying the web pages • Used to predict user behavior • Landing page optimization • Useful for e-commerce websites and e-services
  • 6.
  • 7.
    • Web ContentMining: used for mining of useful data, information and knowledge from web page content • Web Structure Mining: helps to find useful knowledge or information pattern from the structure of hyperlinks • Web Usage Mining: used for mining the web log records(access information of web pages) and helps to discover the user access patterns of web pages Web Mining: Techniques
  • 8.
    Web ContentMining • Processof mining useful information from the contents of the web pages / web documents – text, image, audio, video etc. • Based on content of the input query, Web content mining performs scanning and mining of the text, images, and display in search engines group of web pages • Ex: if user search for particular book in search engine then search engine provides list of suggestions • There are many techniques to extract the data like web scraping. • Scrapy and Octoparse are well known tools that perform web content mining process
  • 9.
    1. Crawlers • Traditionalsearch engines use crawlers to search the Web and gather information, indexing techniques to store information and query processing to provide fast and accurate information to users • Web crawler is a program that acts as an automated script which browse through the internet in a systematic way • Primarily programmed for repetitive actions so that browsing is automated • Search engines use crawlers most frequently to browse internet and build an index
  • 10.
  • 11.
    • Web crawlersare keyword-based, it looks at the keywords in the pages, the kind of the content each page has and the links, before returning the information to search engine. This process is known as Web Crawling 1. Crawlers Crawler Frontier: Stores list of URL’s to visit Page Downloader: download pages from world wide web Web Repository: receives web pages from crawler and stores in database
  • 12.
    • These webcrawlers go by different names, like bots, automatic indexers and robots. • For example, Google's search engine use crawlers to fetch those pages to Google's servers. • Some of the popular web crawlers are • Googlebot • Scrapy (the Python Scraper) • Storm crawler • Elasticsearch River Web, etc. 1. Crawlers
  • 13.
    Web Crawlers: Working •The spider begins its crawl by going through the websites or list of websites that is visited previously • When crawler visit a website, they search for other pages that are worth visiting • Web crawlers can link to new sites, note changes to existing sites and mark dead links Google Search - How It works ? • From trillions of pages on world wide web, Web Crawlers crawl through pages to bring back the results demanded by customers • Site owners can decide which of their pages they want the web crawlers to index, and they can block the pages that need not be indexed.
  • 14.
    • The indexingis done by sorting the pages and looking at the quality of the content and other factors. • Google then generates algorithms to get a better view of what you are searching for, and provides a number of features that make your search more effective, such as: • Spelling : Incase there is an error in the word you typed, Google comes up with several alternatives to help you get on track • Google Instant :Instant results as you type. • Search methods :Different options for searching, other than just typing out the words. This includes images and voice search. Working of Web Crawlers
  • 15.
    • Synonyms :Tackles similar worded meanings and produces results • Auto complete : Anticipates what you need from what you type. • Query understanding: An in-depth understanding of what you type Working of Web Crawlers
  • 16.
    Types of Crawlers •Periodic crawler : A traditional crawler, in order to refresh its collection, periodically replaces the old documents with the newly downloaded documents. As it is activated periodically every time it is activated it replaces the existing index • Incremental Crawler: crawler incrementally refreshes the existing collection of pages by visiting them frequently and updates the index incrementally instead of replacing it • Focused Crawler: This web crawler tries to download the web pages that are related to each other i.e. it visits pages related to topic of interest. This is also known as topic crawler
  • 17.
  • 18.
    Web Crawler: Applications •Price comparison portals search • A crawler may collect publicly available e-mail or postal addresses of companies for target advertising • Web analysis tools use crawlers to collect data for page views, or incoming or outbound links. • Crawlers serve to provide information hubs with data, for example, news sites.
  • 19.
    2. Harvest Systems •Data harvesting uses a process that extracts and analyzes data collected from online sources • Based on the use of caching, indexing, and crawling. • Harvest is actually a set of tools that facilitate gathering of information from diverse sources. • For data harvesting, a website is targeted, and the data from that site is extracted. • Might be simple text found on the page or within the page's code. • Could be directory information from a retail site. • Might even be a series of images and videos.
  • 20.
    • The harvestdesign is centered around the use of gatherers and brokers. • A gatherer obtains information for indexing from an Internet service provider • Broker provides the index and query interface • Brokers may interface directly with gatherers or may go through other brokers to get to the gatherers. 2. Harvest Systems
  • 21.
  • 22.
  • 23.
    3. Virtual WebView • Web Data Mining Query Language • Provides data mining operations on MLDB
  • 24.
    • Web personalizationis the process of customizing a web site to the needs of each specific user or set of users like • Provision of recommendations to the users • Highlighting/adding links • Creation of index pages, etc. • The web personalization systems are mainly based on the exploitation of the navigational patterns of the website's visitors. • The process of providing information that is related to user's current page is known as web personalization. 4. Personalization
  • 25.
    • For example:e-commerce • The key information that is required for suggesting these similar web pages comes from • Knowledge of other users who have also visited the current page • As well as web page content, the structure of the web page or the user's profile information. • All these help in creating a focused and personalized web browsing experience for the user. 4. Personalization
  • 26.
  • 27.
    • The webpersonalization process can be divided into four phases namely 1. Data collection 2. Pre-processing of web data 3. Analysis of web data 4. Decision making or recommendation. 4. Web PersonalizationPhases
  • 28.
    4. Web PersonalizationPhases 1.Data collection • Data collection is the process of gathering information either explicitly or implicitly specific to each visitor for recording their interests and behavior while they browse a web site • Implicit data: collection of activities completed in the past and recorded in web server logs • Explicit data: information submitted by user at the time of registration or in response to rating questionnaires • Web data in the form of content, structure, semantic, usage and user profile may be collected
  • 29.
    Phase 2: Preprocessingof Data • Log data collected from web server are text files with row for each http transactions • These data needs to be cleansed before putting them for analysis • Preprocessing filters out irrelevant information according to goal of analysis Phase 3: Data Analysis/ Mining • Specific data mining techniques which are used for mining of web data are applied to the pre-processed data to discover interesting usage patterns. • It classifies the content of a web site into semantic categories in order to make information retrieval and presentation easier for user 4. Web PersonalizationPhases
  • 30.
    Phase 4: RecommendationPhase • This last phase usually performs recommendation to users by determining existing hyperlinks, dynamic insertion of new hyperlinks that seems to be interest for current user to last web page requested by user or even creation of new index pages 4. Web PersonalizationPhases
  • 31.
    Types of Personalization •There are three approaches for generating a personalized web experience for a user: • Content based Filtering • Collaborative Filtering • Model based Techniques • Memory based Techniques
  • 32.
    1. ContentBased Filtering •The approach to recommendation generation is based around the analysis of items previously rated by a user and generating a profile for a user based on the content descriptions of these items. • Several early recommender systems were based on content- based filtering including Personal Web Watcher, Info Finder, Newsreaders, Letizia and Syskill and Webert.
  • 33.
    2. Collaborative Filtering •The basic idea as presented by Goldberg et al. was that people collaborate to help each other perform filtering by recording their reactions to e-mails in the 'form of annotations. • Users provide feedback on the items that they consume, in the form ratings. • To recommend items to the active user, previous feedback is used to find other likeminded users. • Items that have been consumed by compatible users but not by the current user are candidates for recommendation.
  • 34.
    3. Model basedTechniques • Model based collaborative techniques use a two-stage process for recommendation • The first stage is carried out offline, where user behavioral data collected during previous interactions is mined and an explicit model generated for use in future online interactions. • The second stage is carried out in real-time as a new visitor begins an interaction with the Web site. • Data from the current user session is scored using the models generated offline, and recommendations generated based on this scoring.
  • 35.
    Model based vs.Memory based Techniques
  • 36.
  • 37.
    Web Structure Mining •Web structure mining is used for creating a model of web organization • Process of analyzing the nodes and connection structure of a website using graph theory
  • 38.
    Web Structure Mining Why? •Used to classify web pages • Helpful to create information such as relationship and similarity between different websites • Useful for discovering website types • Authority Sites: information about the subject • Hubs sites: point to many authority sites
  • 39.
    Algorithms for WebStructureMining PageRank algorithm (Google Founders) • Looks at number of links to a website and importance of referring links • Computed before the user enters the query. HITS algorithm (Hyperlinked Induced Topic Search) • User receives two lists of pages for query (authority and hubpages) • Computations are done after the user enters the query
  • 40.
    PageRank Algorithm • Theidea of the algorithm came from academic citation literature. • It was developed in 1998 as part of the Google search engine prototype • Studies citation relationship of documents within the web. • Google search engine ranks documents as a function of both the query terms and the hyperlink structure of the web
  • 41.
    Definition of PageRank •The Page Rank produces ranking independent of a user's query. • The importance of a web page is determined by the number of other important web pages that are pointing to that page and the number of out links from other web pages
  • 42.
    Examples of Backlinks •Page A is a inlinks of page B and page C, while page B and page C are inlinks of page D.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 57.
    Comparison Algorithm PageRank HITS Mining TechniqueUsed Web structure mining Web structure and web content mining Working • Computes scores at indexing time • Results are sorted according to importance of pages • Computes hub and authority scores of n highly relevant pages on the fly Applied on Entire Web Local neighborhood of pages surrounding results of a query Input parameters Back links Back links, Forward links and content Complexity O(log N) O(log N) Limitations Query Independent Efficiency problem Search Engine Google CLEVER
  • 58.
  • 59.
  • 60.
    Web Usage Mining •Web usage Mining is process of extracting patterns and information from server logs to gain insights on user activity like: • where the users are from • how many users clicked what item on the site and • types of activities being done on the site. • Web server logs are considered as a raw data in return meaningful data are extracted and patterns are identified. • For instance, for any e-commerce business, when they want to increase the scope of business, user's web activity through the application logs are monitored and data mining is applied to it.
  • 61.
    • Some ofthe techniques to discover and analyze the web usage pattern are : Session and visitor analysis • The analysis of pre-processed data can be performed in session analysis, which includes the record of visitors, days and sessions etc. • Information can be used to analyze behavior of visitors • Report generated after analysis, contains details of frequently visited web pages, common entry and exits Web Usage Mining
  • 62.
    OLAP (Online AnalyticalProcessing): • OLAP performs Multidimensional analysis of complex data • OLAP can be performed on different parts of log related data in interval of time. • The OLAP tool can be used to derive important business intelligence metrics Web Usage Mining
  • 63.
  • 64.
    1. Preprocessing: • Preprocessingconsists of converting the usage, content, and structure information contained in the various available data sources into the data abstractions necessary for pattern discovery • Usage Preprocessing : • Usage preprocessing is most difficult task in the Web Usage Mining process due to the incompleteness of the available data. • Unless a client-side tracking mechanism is used, only the IP address, agent, and Server-side click stream are available to identify users and server sessions. Web Usage Mining Process
  • 65.
    • Content Preprocessing: •Consist of converting the text, image, scripts, and other multimedia into forms that are useful for web usage mining • It consists of performing content mining such a classification or clustering. • Structure Preprocessing: • Structure of website is created by hypertext links between page views • Structure can be preprocessed in same manner as content Web Usage Mining Process
  • 66.
    2. Pattern Discovery •Pattern discovery uses methods and algorithms developed in several domains like statistics, data mining, machine learning and pattern recognition • Statistical Analysis: extract knowledge about visitors by performing descriptive statistical analysis frequency of page views, viewing time and length of navigational path • Association Rules: set of pages that are accessed together with min support count • Clustering: Two kinds of interesting clusters to mine: usage cluster and page cluster Web Usage Mining Process
  • 67.
    • Classification: Classifyuser profiles into different class/category based on their browsing activity • Sequential Patterns: Web marketers can predict future visit patterns which help in placing advertisement aimed for certain group of users • Dependency modeling: develop model capable of representing significant dependencies among various variables in web domain Web Usage Mining Process
  • 68.
    3. Pattern Analysis: •Filter out uninteresting rules and patterns from set found in pattern discovery phase • Load usage data into data cube to perform OLAP operations • Visualization techniques like graphs or assign colors to different values can highlight overall pattern Web Usage Mining Process
  • 69.