Mining a Large Web Corpus

International Internet Preservation Consortium
General Assembly 2014, Paris
Mining a Large Web Corpus
Robert Meusel
Christian Bizer

Hyperlink Graphs
Knowledge about the structure of the Web can be used to
improve crawling strategies, to help SEO experts or to
understand social phenomena.

HTML-embedded Data on the Web
Several million websites semantically markup the content of
their HTML pages.
Markup Syntaxes
 Microformats
 RDFa
 Microdata
Data snippets
within info boxes

Relational HTML Tables
HTML Tables over semi-structured data which can be used to
build up or extend knowledge bases as DBPedia.
• Cafarella, et al.: WebTables: Exploring the Power of Tables on the Web. VLDB 2008.
 In a corpus of 14B raw
tables, 154M are „good“
relations (1.1%)

The Web Data Commons Project
 Has developed an Amazon-based framework for extracting data
from large web crawls
 Capable to run on any cloud infrastructure
 Has applied this framework to the Common Crawl data
 Adaptable to other crawls
 Results and framework are publicly available
 http://webdatacommons.org
Goal: Offer an easy-to-use, cost efficient, distributed
extraction framework for large web crawls, as well as
datasets extracted out of the crawls.

Extraction Framework
AWS EC2
Instance
AWS EC2
Instance
Master
AWS SQS
AWS EC2
Instance
AWS S3
1: Fill queue
2: Launch instances
3: Request
file-reference
4: Download file
5: Extract &
Upload
automated
manual
6: Collect results

Extraction Worker
AWS S3
AWS S3
WDC Extractor
.(w)arc
Worker
Filter
output
Worker:
• Written in Java
• Process one page at
once
• Independent from
other files and
workers
Download file
Upload output file
Filter:
• Reduce Runtime
• Mime-Type filter
• Regex detection of
content or meta-
information
Worker

Web Data Commons – Extraction Framework
 Written in Java
 Mainly tailored for Amazon Web Services
 Fault tolerant and cheap
 300 USD to extract 17 billion RDF statements from 44 TB
 Easy customizable
 Only worker has to be adapted
 Worker is a single process method processing one file each time
 Scaling is automated by the framework
 Access Open Source Code:
 https://www.assembla.com/code/commondata/
Alternative: Hadoop Version, which can run on any Hadoop
cluster without Amazon Web Services.

Extracted Datasets
 Hyperlink Graph
 HTML-embedded Data
 Relational HTML Tables
Hyperlink Graph
HTML-embedded Data
Relational HTML Tables

Hyperlink Graph
 Extracted from the Common Crawl 2012 Dataset
 Over 3.5 billion pages connected by over 128 billion links
 Graph files: 386 GB
http://webdatacommons.org/hyperlinkgraph/
http://wwwranking.webdatacommons.org/

Hyperlink Graph
 Degrees do not follow a power-law
 Detection of Spam pages
 Further insights:
 WWW‘14: Graph Structure in the Web – Revisited (Meusel et al.)
 WebSci‘14: The Graph Structure of the Web aggregated by Pay-Level Domain (Lehmberg et al.)
Discovery of evolutions in the global structure of the World
Wide Web.

Hyperlink Graph
Discovery of important and interesting sites using different
popularity rankings or website categorization libraries
Websites connected by at least ½ Million Links

HTML-embedded Data
More and more Websites semantically
markup the content of their HTML pages.
Markup Syntaxes
RDFa
Microformats
Microdata

Websites containing Structured Data (2013)
1.8 million websites (PLDs) out of 12.8 million
provide Microformat, Microdata or RDFa data (13.9%)
585 million of the 2.2 billion pages contain
Microformat, Microdata or RDFa data (26.3%).
Web Data Commons - Microformat, Microdata, RDFa Corpus
 17 billion RDF triples from Common Crawl 2013
 Next release will be in winter 2014
http://webdatacommons.org/structureddata/

Top Classes Microdata (2013)
• schema = Schema.org
• dv = Google‘s
Rich Snippet Vocabulary

HTML Tables
• Cafarella, et al.: WebTables: Exploring the Power of Tables on the Web. VLDB 2008.
• Crestan, Pantel: Web-Scale Table Census and Classification. WSDM 2011.
In corpus of 14B raw tables, 154M are “good” relations (1.1%).
Cafarella (2008)
Classification Precision: 70-80%

WDC - Web Tables Corpus
 Large corpus of relational Web tables for public download
 Extracted from Common Crawl 2012 (3.3 billion pages)
 147 million relational tables
 selected out of 11.2 B raw tables (1.3%)
 download includes the HTML pages of the tables (1TB zipped)
 Table Statistics
 Heterogeneity: Very high.
http://webdatacommons.org/webtables/
Min Max Average Median
Attributes 2 2,368 3.49 3
Data Rows 1 70,068 12.41 6

 Attribute Statistics
28,000,000 different attribute labels
WDC - Web Tables Corpus
Attribute #Tables
name 4,600,000
price 3,700,000
date 2,700,000
artist 2,100,000
location 1,200,000
year 1,000,000
manufacturer 375,000
counrty 340,000
isbn 99,000
area 95,000
population 86,000
 Subject Attribute Values
1.74 billion rows
253,000,000 different subject labels
Value #Rows
usa 135,000
germany 91,000
greece 42,000
new york 59,000
london 37,000
athens 11,000
david beckham 3,000
ronaldinho 1,200
oliver kahn 710
twist shout 2,000
yellow submarine 1,400

Conclusion
Three factors are necessary to work with web-scale data:
 Thanks to Common Crawl, this data is available
 Like Amazon or other on-demand cloud-services
 The Web Data Commons Framework, or standard tools like Pig
 Cost evaluation on task-base, but the WDC framework has turned
out to be cheaper
Availability of Crawls
Availability of cheap, easy-to-use infrastructures
Easy to adopt scalable extraction frameworks

Questions
 Please visit our website: www.webdatacommons.org
 Data and Framework are available as free download
 Web Data Commons is supported by:

Mining a Large Web Corpus

More Related Content

What's hot

Viewers also liked

Similar to Mining a Large Web Corpus

Recently uploaded

In this document

Mining a Large Web Corpus