The Graph Structure of the Web
- Aggregated by Pay-Level Domain
Oliver Lehmberg, Robert Meusel, Christian Bizer
Research Group Data and Web Science
General Knowledge about the Web Graph
• Broder et al.* in 2000:
– In- and Outdegree follow power laws
– There is a directed path between two pages in 25% of all cases
– The Web Graph has the bow-tie structure
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
*A. Broder, R. Kumar, F. Maghoul, P. Raghavan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web.
In WWW’00, pages 309–320. North-Holland Publishing Co, 2000.
Slide 2
Our Contributions
• R. Meusel, S. Vigna, O. Lehmberg, and C. Bizer. Graph structure
in the web – revisted. WWW ’14, 2014.
– Analysis of the 2012 Web Graph on page level
• This presentation:
– Analysis of the same graph, aggregated by pay-level domain (PLD)
– Focus on inter-website connections
– No intra-website links
• Additionally:
– Interconnections between topical groups of websites
– Public Suffix aggregation
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
Slide 3
DATA SET
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
Slide 4
Web Data Commons Hyperlink Graph
• Page level: the largest hyperlink graph available to the public
– extracted from Common Crawl
– 3.5 billion nodes (web pages)
– 128 billion arcs (hyperlinks)
• Aggregated by pay-level domain
– 43 million nodes (websites)
– 623 million arcs (aggregated hyperlinks)
– 240 million registered domains in the Web in 2012 (18%)*
• Pay-level domain:
– dws.informatik.uni-mannheim.de  uni-mannheim.de
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
*http://www.verisigninc.com/assets/domain-name-brief-oct2012.pdf
Slide 5
Downloading the WDC Hyperlink Graph
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
http://webdatacommons.org/hyperlinkgraph/
• 4 aggregation levels:
• Extraction code is published under Apache License
– Extraction costs per run: ~ 200 US$ in Amazon EC2 fees
Graph #Nodes #Arcs Size (zipped)
Page graph 3.56 billion 128.73 billion 376 GB
Subdomain graph 101 million 2,043 million 10 GB
1st level subdomain graph 95 million 1,937 million 9.5 GB
PLD graph 43 million 623 million 3.1 GB
Slide 6
GRAPH HANDS-ON
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
Slide 7
Node Centrality Ranking
http://wwwranking.webdatacommons.org
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
Slide 8
Top PLD Lists
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
Rank Website Outdegree Website Indegree Website PageRank
1 blogspot.com 3.898.561 wordpress.org 1.822.440 wordpress.org 113,388
2 wordpress.com 2.249.553 youtube.com 1.319.548 gmpg.org 111,173
3 youtube.com 1.078.938 wikipedia.org 1.243.291 youtube.com 88,206
4 wikipedia.org 862.705 gmpg.org 1.156.727 twitter.com 54,644
5 serebella.com 699.609 blogspot.com 1.034.450 wikipedia.org 54,081
6 refertus.info 668.271 google.com 782.660 blogspot.com 40,901
7 top20directory.com 650.884 wordpress.com 710.590 google.com 40,799
8 typepad.com 551.360 twitter.com 646.239 wordpress.com 28,018
9 botw.org 496.645 yahoo.com 554.251 yahoo.com 27,594
10 tumblr.com 496.045 flickr.com 339.231 networkadvertising.org 27,395
11 dmoz.org 476.890 facebook.com 314.051 apple.com 23,929
12 vindhetviahier.nl 424.646 apple.com 312.396 phpbb.com 22,329
13 jcsearch.com 423.918 miibeian.gov.cn 289.605 miibeian.gov.cn 22,165
14 startpagina.nl 392.543 vimeo.com 269.003 hugedomains.com 20,793
15 yahoo.com 371.087 tumblr.com 226.596 facebook.com 20,254
16 tatu.us 370.918 joomla.org 201.863 joomla.org 18,146
17 freeseek.org 362.310 amazon.com 196.690 flickr.com 17,966
18 lap.hu 352.668 w3.org 196.507 adobe.com 17,903
19 blau-webkatalog.com 312.924 nytimes.com 193.907 linkedin.com 16,083
20 allepaginas.nl 276.578 sourceforge.net 189.663 w3.org 15,539
Slide 9
Most interlinked PLDs
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
Slide 10
GRAPH ANALYSIS
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
Slide 11
In- and Outdegree – Power-Laws?
Power-Law:
𝑦 ∝ 𝑥−𝛾
Methodology:
• Clauset et al.*
Maximum-
likelihood fitting
(plfit *²)
• Goodness-of-fit
test
Indegree results:
𝑥0 = 3,062
𝛾 = 2.40
Cannot reject
power law
hypothesis
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
Slide 12
* Clauset et al.: Power-Law Distributions in Empirical Data. SIAM Review 2009.*² https://github.com/ntamas/plfit
In- and Outdegree – Power-Laws?
Outdegree results:
𝑥0 = 496
𝛾 = 2.39
Must reject power
law hypothesis
Yet unclear which
distribution fits
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
Slide 13
Bow-Tie Structure
Observations:
Small IN component
Large OUT component
TEND and TUBES almost non-
existent
Compared to Broder et al.:
Unbalanced
LSCC much larger
Compared to our page graph*:
Proportions of IN and OUT
exchanged
Large fraction of IN pages were
merged into LSCC (ca. 1 billion
pages)
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
* R. Meusel, S. Vigna, O. Lehmberg, and C. Bizer. Graph structure in the web – revisted. WWW ’14, 2014.
Slide 14
Distance Distribution
Methodology:
Approximate distribution
several times (using
Hyperball*)
Connected pairs:
42.42(±3.59)%
Avg. distance:
4.27(±0.085)
Diameter (at least):
48
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
*P. Boldi and S. Vigna. In-core computation of geometric centralities with HyperBall:
A hundred billion nodes and beyond. In ICDMW 2013. IEEE, 2013
Slide 15
High connectivity based on Hubs?
• LSCC of 51.9%, 42% connected pairs & avg. distance of 4.27
– How important are hubs in this graph?
• Approach:
– A) Remove links to Hubs (i.e. high indegree)
– B) Keep only links to Hubs
– Repeat this for different indegree values as thresholds and then
measure largest remaining WCC/SCC
• Results
– Removing links to nodes with high indegree: no large SCC once all links
to nodes with indegree 10 or higher are removed
– Removing links to nodes with low indegree: the more links we remove,
the more likely are the remaining nodes to be part of the largest SCC
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
Slide 16
Two Layer Model
7/4/2014 Data and Web Science Group 17
Approach:
Remove incoming links from the
graph and measures sizes of
largest SCC/WCC
Subgraph with indegree < 𝟏𝟎
• 73.7% of all nodes weakly
connected
• No large strongly connected
component
•  Low Degree Layer
Subgraph with indegree ≥ 𝟏𝟎
• Removed incoming links of
79.2% of all nodes
• 16.1% of all nodes strongly
connected
•  High Degree Layer
PLD Topic Graph
Approach:
Use topical categories from the
open directory project* to
categorise our websites.
15 topical categories
Results:
“computers”: 6th largest, but largest
number of links
“shopping”: much more incoming
than outgoing links, few internal
links
Conclusion:
No obvious patterns, more
properties needed
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
health Kids and teens
news
Slide 18
*http://dmoz.org
Public Suffix (PS) Graph
Approach:
Top ten PSs from our PLD graph +
“others”
Generally agrees with Verisign
Domain Industry Brief*
gTLDs:
more external than internal links
ccTLDs:
more internal than external links
Extreme cases:
.com does not follow this rule
.de  half of all links are from a
single spammer
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
co.uk ru
others
org
nl
net
it
info
de
com
*http://www.verisigninc.com/assets/domain-name-brief-oct2012.pdf
Slide 19
WebDataCommons.org also offers:
1.Corpus of 17 billion RDFa, Microdata, Microformats statements
2.Corpus of 147 million relational HTML tables
Thank you for your attention!
Advertisement
The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer

The Graph Structure of the Web - Aggregated by Pay-Level Domain

  • 1.
    The Graph Structureof the Web - Aggregated by Pay-Level Domain Oliver Lehmberg, Robert Meusel, Christian Bizer Research Group Data and Web Science
  • 2.
    General Knowledge aboutthe Web Graph • Broder et al.* in 2000: – In- and Outdegree follow power laws – There is a directed path between two pages in 25% of all cases – The Web Graph has the bow-tie structure Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer *A. Broder, R. Kumar, F. Maghoul, P. Raghavan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web. In WWW’00, pages 309–320. North-Holland Publishing Co, 2000. Slide 2
  • 3.
    Our Contributions • R.Meusel, S. Vigna, O. Lehmberg, and C. Bizer. Graph structure in the web – revisted. WWW ’14, 2014. – Analysis of the 2012 Web Graph on page level • This presentation: – Analysis of the same graph, aggregated by pay-level domain (PLD) – Focus on inter-website connections – No intra-website links • Additionally: – Interconnections between topical groups of websites – Public Suffix aggregation Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer Slide 3
  • 4.
    DATA SET Version 6/25/2014The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer Slide 4
  • 5.
    Web Data CommonsHyperlink Graph • Page level: the largest hyperlink graph available to the public – extracted from Common Crawl – 3.5 billion nodes (web pages) – 128 billion arcs (hyperlinks) • Aggregated by pay-level domain – 43 million nodes (websites) – 623 million arcs (aggregated hyperlinks) – 240 million registered domains in the Web in 2012 (18%)* • Pay-level domain: – dws.informatik.uni-mannheim.de  uni-mannheim.de Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer *http://www.verisigninc.com/assets/domain-name-brief-oct2012.pdf Slide 5
  • 6.
    Downloading the WDCHyperlink Graph Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer http://webdatacommons.org/hyperlinkgraph/ • 4 aggregation levels: • Extraction code is published under Apache License – Extraction costs per run: ~ 200 US$ in Amazon EC2 fees Graph #Nodes #Arcs Size (zipped) Page graph 3.56 billion 128.73 billion 376 GB Subdomain graph 101 million 2,043 million 10 GB 1st level subdomain graph 95 million 1,937 million 9.5 GB PLD graph 43 million 623 million 3.1 GB Slide 6
  • 7.
    GRAPH HANDS-ON Version 6/25/2014The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer Slide 7
  • 8.
    Node Centrality Ranking http://wwwranking.webdatacommons.org Version6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer Slide 8
  • 9.
    Top PLD Lists Version6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer Rank Website Outdegree Website Indegree Website PageRank 1 blogspot.com 3.898.561 wordpress.org 1.822.440 wordpress.org 113,388 2 wordpress.com 2.249.553 youtube.com 1.319.548 gmpg.org 111,173 3 youtube.com 1.078.938 wikipedia.org 1.243.291 youtube.com 88,206 4 wikipedia.org 862.705 gmpg.org 1.156.727 twitter.com 54,644 5 serebella.com 699.609 blogspot.com 1.034.450 wikipedia.org 54,081 6 refertus.info 668.271 google.com 782.660 blogspot.com 40,901 7 top20directory.com 650.884 wordpress.com 710.590 google.com 40,799 8 typepad.com 551.360 twitter.com 646.239 wordpress.com 28,018 9 botw.org 496.645 yahoo.com 554.251 yahoo.com 27,594 10 tumblr.com 496.045 flickr.com 339.231 networkadvertising.org 27,395 11 dmoz.org 476.890 facebook.com 314.051 apple.com 23,929 12 vindhetviahier.nl 424.646 apple.com 312.396 phpbb.com 22,329 13 jcsearch.com 423.918 miibeian.gov.cn 289.605 miibeian.gov.cn 22,165 14 startpagina.nl 392.543 vimeo.com 269.003 hugedomains.com 20,793 15 yahoo.com 371.087 tumblr.com 226.596 facebook.com 20,254 16 tatu.us 370.918 joomla.org 201.863 joomla.org 18,146 17 freeseek.org 362.310 amazon.com 196.690 flickr.com 17,966 18 lap.hu 352.668 w3.org 196.507 adobe.com 17,903 19 blau-webkatalog.com 312.924 nytimes.com 193.907 linkedin.com 16,083 20 allepaginas.nl 276.578 sourceforge.net 189.663 w3.org 15,539 Slide 9
  • 10.
    Most interlinked PLDs Version6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer Slide 10
  • 11.
    GRAPH ANALYSIS Version 6/25/2014The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer Slide 11
  • 12.
    In- and Outdegree– Power-Laws? Power-Law: 𝑦 ∝ 𝑥−𝛾 Methodology: • Clauset et al.* Maximum- likelihood fitting (plfit *²) • Goodness-of-fit test Indegree results: 𝑥0 = 3,062 𝛾 = 2.40 Cannot reject power law hypothesis Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer Slide 12 * Clauset et al.: Power-Law Distributions in Empirical Data. SIAM Review 2009.*² https://github.com/ntamas/plfit
  • 13.
    In- and Outdegree– Power-Laws? Outdegree results: 𝑥0 = 496 𝛾 = 2.39 Must reject power law hypothesis Yet unclear which distribution fits Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer Slide 13
  • 14.
    Bow-Tie Structure Observations: Small INcomponent Large OUT component TEND and TUBES almost non- existent Compared to Broder et al.: Unbalanced LSCC much larger Compared to our page graph*: Proportions of IN and OUT exchanged Large fraction of IN pages were merged into LSCC (ca. 1 billion pages) Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer * R. Meusel, S. Vigna, O. Lehmberg, and C. Bizer. Graph structure in the web – revisted. WWW ’14, 2014. Slide 14
  • 15.
    Distance Distribution Methodology: Approximate distribution severaltimes (using Hyperball*) Connected pairs: 42.42(±3.59)% Avg. distance: 4.27(±0.085) Diameter (at least): 48 Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer *P. Boldi and S. Vigna. In-core computation of geometric centralities with HyperBall: A hundred billion nodes and beyond. In ICDMW 2013. IEEE, 2013 Slide 15
  • 16.
    High connectivity basedon Hubs? • LSCC of 51.9%, 42% connected pairs & avg. distance of 4.27 – How important are hubs in this graph? • Approach: – A) Remove links to Hubs (i.e. high indegree) – B) Keep only links to Hubs – Repeat this for different indegree values as thresholds and then measure largest remaining WCC/SCC • Results – Removing links to nodes with high indegree: no large SCC once all links to nodes with indegree 10 or higher are removed – Removing links to nodes with low indegree: the more links we remove, the more likely are the remaining nodes to be part of the largest SCC Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer Slide 16
  • 17.
    Two Layer Model 7/4/2014Data and Web Science Group 17 Approach: Remove incoming links from the graph and measures sizes of largest SCC/WCC Subgraph with indegree < 𝟏𝟎 • 73.7% of all nodes weakly connected • No large strongly connected component •  Low Degree Layer Subgraph with indegree ≥ 𝟏𝟎 • Removed incoming links of 79.2% of all nodes • 16.1% of all nodes strongly connected •  High Degree Layer
  • 18.
    PLD Topic Graph Approach: Usetopical categories from the open directory project* to categorise our websites. 15 topical categories Results: “computers”: 6th largest, but largest number of links “shopping”: much more incoming than outgoing links, few internal links Conclusion: No obvious patterns, more properties needed Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer health Kids and teens news Slide 18 *http://dmoz.org
  • 19.
    Public Suffix (PS)Graph Approach: Top ten PSs from our PLD graph + “others” Generally agrees with Verisign Domain Industry Brief* gTLDs: more external than internal links ccTLDs: more internal than external links Extreme cases: .com does not follow this rule .de  half of all links are from a single spammer Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer co.uk ru others org nl net it info de com *http://www.verisigninc.com/assets/domain-name-brief-oct2012.pdf Slide 19
  • 20.
    WebDataCommons.org also offers: 1.Corpusof 17 billion RDFa, Microdata, Microformats statements 2.Corpus of 147 million relational HTML tables Thank you for your attention! Advertisement The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer