Big Data Technology 
and the Social Sciences: 
A Lecture at Mannheim University 
Abe Usher CCHP, CISSP 
Chief Technology Officer, HumanGeo
2 
What’s In It For You? 
Theory 
• Definitions and overview 
•Where data are being generated 
Practice 
• Google’s three secret techniques* 
for unlocking insights from data 
•The kitchen model 
•Recommended resources to build 
data science skills 
Presentation slides: 
http://www.slideshare.net/abeusher/big-data-and-the-social-sciences 
*Not specifically endorsed by Google. Also, not really a secret.
3 
Background 
HumanGeo is focused on digital Human Geography: 
 Understanding the location attributes of individuals and groups 
 And the social attributes of locations 
 Through ‘Big Data’ analysis of billions geolocated data elements
4 
Big Data Wake-Up Call 
Berkeley University Research http://goo.gl/zjSUr1 
By 2016 the rate of data growth surpasses the rate of Moore’s Law
5 
Defining Big Data 
http://knowyourmeme.com/memes/you-keep-using-that-word-i-do-not-think-it-means-what-you-think-it-means
6 
Big Data Definition 
Boring Traditional definition 
“High volume, velocity and variety 
information assets that demand 
cost-effective, innovative forms of 
information processing for 
enhanced insight and decision 
making.”
7 
Big Data Definition 
Abe’s definition:
8 
The Original “Big Data” 
1880 US Census 
• 50 million people 
•Data included: age, gender, number 
of insane people in household* 
•Took 7 years to tabulate 
• 1890 Census estimated at 13 years to 
complete 
*Credit to Ken Krugler for this factoid: http://www.censusrecords.com/content/1880_census
9 
The Original “Big Data” 
1880 US Census 
• 50 million people 
•Data included: age, gender, number 
of insane people in household* 
•Took 7 years to tabulate 
• 1890 Census estimated at 13 years to 
complete 
1890 
• 63 million people 
•Additional data: citizenship and 
military service 
•New technology: Hollerith Tabulating 
System 
•Took 6 weeks to tabulate (76x faster) 
Takeaway 
• Better technology and methodology led 
to 76x speedup 
*Credit to Ken Krugler for this factoid: http://www.censusrecords.com/content/1880_census
10 
Data Generation 
Where are data created? 
•Website interaction logs 
•Social Media 
•Cyber events 
• Smartphones 
What is the volume? 
•3B phone calls in USA 
• 700M Facebook posts 
• 500M tweets per day 
• 50B WhatsApp messages per day 
Takeaway 
• Social media, telecommunication, 
and instant messaging generate an 
increasingly high volume of data
11 
Traditional Model 
of Interpreting Observations 
Tracy Morrow (aka “Ice T”) 
How can you identify a 
legitimate hip-hop artist 
(versus someone who just gets 
up and rhymes)? 
http://www.npr.org/2005/08/30/4824690/original-gangster-rapper-and-actor-ice-t
12 
Tracy Morrow (aka “Ice T”) 
Traditional Model 
of Interpreting Observations 
How can you identify a 
legitimate hip-hop artist 
(versus someone who just gets 
up and rhymes)? 
“Game knows game, baby.”
13 
Tracy Morrow (aka “Ice T”) 
Traditional Model 
of Interpreting Observations 
How can you identify a 
legitimate hip-hop artist 
(versus someone who just gets 
up and rhymes)? 
“If you have expert knowledge, 
then you are capable of 
answering complex questions 
by interpreting domain specific 
information.” [paraphrased]
Trust Models for complex data 
• August Gorman carried out a 
plot to grab fractions of a 
penny from a corporate payroll 
system. http://goo.gl/vAScel 
14 
IMDB: 4.9/10 
Rotten Tomatoes: 26/100
Trust Models for complex data 
• Peter Gibbons hatches a plot 
to write a computer virus that 
grab fractions of a penny from 
a corporate retirement 
account. http://goo.gl/rDg1U 
• Known in security circles as a 
salami attack. 
15 
IMDB: 7.9/10 
Rotten Tomatoes: 79/100 
Takeaway point: Little bits of value (information) 
provide deep insights in the aggregate
16 
1. Aggregation 
2. Visualization 
3. Correlation 
New Models of 
Interpreting (Big) Data 
Takeaways 
• Expert based knowledge is no 
longer sufficient. 
• Simple mathematical methods 
create value from captured data
17 
Aggregation 
(Counting) 
William Thomson, 1st Baron Kelvin 
"When you can measure 
what you are speaking 
about, and express it in 
numbers, you know 
something about it.” 
Takeaway 
• Aggregation via counting things 
is the most common way to 
exploit Big Data
Aggregation: 
A Tale of Two Products 
The book “Fearless” is much more popular than the 80s movie “Navy Seals.” 
It also has a more favorable distribution of reviews.
The distribution we’re looking for looks like the #1 hand: 
Responses concentrated in the most positive category, 
With very few responses that were unfavorable. 
Aggregation: 
A Tale of Two Products
Aggregation & Visualization: 
Counting with Google Trends
Aggregation & Visualization: 
Bing Search vs. Google Search
Aggregation: 
Diet Pepsi vs. Diet Coke
Aggregation & Visualization: 
Big Data vs. Britney Spears
Geospatial Visualization Example: 
Social Drift in DC 
Takeaway 
• Visualization provides a 
powerful mechanism for 
Exploratory Data Analysis 
A
25 
Correlation: 
Canadian Flu Research 
Gunther Eysenbach 
• Professor @ University of 
Toronto 
• Focused on eHealth 
•Google Ads user 
Infodemiology 
• 2004-2005 tracked flu related 
searches 
• 54,507 Ad impressions in 
Canada 
• High R^2 correlation to actual 
flu activity 
http://gunther-eysenbach.blogspot.com/ 
Infodemiology paper: http://goo.gl/aeUZtA 
Takeaway 
• Human behavior in response to 
Google Ads related to the flu was 
highly correlated with “officially 
reported” cases of the flu.
NYT: http://goo.gl/mNyAi7 
26 
Correlation: 
Google Flu Trends 
“Google Flu Trends provides near 
real-time estimates of flu activity 
for a number of countries and 
regions around the world based on 
aggregated search queries.” 
Process 
•Map searches to regions 
• Quantify “normal” 
• Detect “anomalies” 
NPR: http://goo.gl/Iv7A87
27 
Correlation: 
Box Office Hit Prediction 
“Use of socially generated ‘big 
data’ to access information 
about collective states of the 
minds in human societies has 
become a new paradigm in the 
emerging field of computational 
social science.” 
Simple factors 
• number of total page views 
• number of total edits made 
• number of users editing 
• number of revisions in the 
article's revision history 
Early Prediction of Movie Box Office Success: http://goo.gl/BWf7H1 
Counts of Wikipedia factors correlate to Box Office sales
28 
Big Data: 
Significance for Social Sciences 
1. Proxy variables. 
Digital exhaust collected for purposes other than survey often creates 
‘proxy variables’ that provide complementary insights. 
2. Aggregation Insights. 
Combining many small observations leads to insights that we can trust. 
3. Data Linking. 
It is possible to ‘link’ or synchronize records between digital exhaust and 
instrumented surveys by selecting a common dimension (e.g. location). 
The future of social science will involve combining 
“fuzzy Big Data insights” with instrumented survey results
Correlation Does Not Equal Causation 
http://xkcd.com/552/
The kitchen model of value creation 
Chef Ingredients Utensils Recipes 
Your 
Staff 
Your 
Data 
Technology Techniques
31 
Take Action: 
Experiment yourself 
Exploratory Data Analysis lifecycle: 
• collect - Twitter API, Datasift.com 
• clean - open refine 
• analyze - Python or R 
• visualize - Google Earth 
Related data: 
https://s3.amazonaws.com/devbackup/germany.txt.gz 
Related code: https://github.com/abeusher
32 
Take Action: Explore 
Google Trends http://goo.gl/8eJZg Google Ngram http://goo.gl/4U09fa 
Google Correlate http://goo.gl/nEhe8D Bing Keyword Research http://goo.gl/q2V88g
33 
Contact information 
Abe Usher 
Email: abe.usher@gmail.com 
Twitter: @abeusher 
LinkedIn: http://goo.gl/DUxZOP 
Presentations: http://goo.gl/bCa3Qt

Big Data and the Social Sciences

  • 1.
    Big Data Technology and the Social Sciences: A Lecture at Mannheim University Abe Usher CCHP, CISSP Chief Technology Officer, HumanGeo
  • 2.
    2 What’s InIt For You? Theory • Definitions and overview •Where data are being generated Practice • Google’s three secret techniques* for unlocking insights from data •The kitchen model •Recommended resources to build data science skills Presentation slides: http://www.slideshare.net/abeusher/big-data-and-the-social-sciences *Not specifically endorsed by Google. Also, not really a secret.
  • 3.
    3 Background HumanGeois focused on digital Human Geography:  Understanding the location attributes of individuals and groups  And the social attributes of locations  Through ‘Big Data’ analysis of billions geolocated data elements
  • 4.
    4 Big DataWake-Up Call Berkeley University Research http://goo.gl/zjSUr1 By 2016 the rate of data growth surpasses the rate of Moore’s Law
  • 5.
    5 Defining BigData http://knowyourmeme.com/memes/you-keep-using-that-word-i-do-not-think-it-means-what-you-think-it-means
  • 6.
    6 Big DataDefinition Boring Traditional definition “High volume, velocity and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.”
  • 7.
    7 Big DataDefinition Abe’s definition:
  • 8.
    8 The Original“Big Data” 1880 US Census • 50 million people •Data included: age, gender, number of insane people in household* •Took 7 years to tabulate • 1890 Census estimated at 13 years to complete *Credit to Ken Krugler for this factoid: http://www.censusrecords.com/content/1880_census
  • 9.
    9 The Original“Big Data” 1880 US Census • 50 million people •Data included: age, gender, number of insane people in household* •Took 7 years to tabulate • 1890 Census estimated at 13 years to complete 1890 • 63 million people •Additional data: citizenship and military service •New technology: Hollerith Tabulating System •Took 6 weeks to tabulate (76x faster) Takeaway • Better technology and methodology led to 76x speedup *Credit to Ken Krugler for this factoid: http://www.censusrecords.com/content/1880_census
  • 10.
    10 Data Generation Where are data created? •Website interaction logs •Social Media •Cyber events • Smartphones What is the volume? •3B phone calls in USA • 700M Facebook posts • 500M tweets per day • 50B WhatsApp messages per day Takeaway • Social media, telecommunication, and instant messaging generate an increasingly high volume of data
  • 11.
    11 Traditional Model of Interpreting Observations Tracy Morrow (aka “Ice T”) How can you identify a legitimate hip-hop artist (versus someone who just gets up and rhymes)? http://www.npr.org/2005/08/30/4824690/original-gangster-rapper-and-actor-ice-t
  • 12.
    12 Tracy Morrow(aka “Ice T”) Traditional Model of Interpreting Observations How can you identify a legitimate hip-hop artist (versus someone who just gets up and rhymes)? “Game knows game, baby.”
  • 13.
    13 Tracy Morrow(aka “Ice T”) Traditional Model of Interpreting Observations How can you identify a legitimate hip-hop artist (versus someone who just gets up and rhymes)? “If you have expert knowledge, then you are capable of answering complex questions by interpreting domain specific information.” [paraphrased]
  • 14.
    Trust Models forcomplex data • August Gorman carried out a plot to grab fractions of a penny from a corporate payroll system. http://goo.gl/vAScel 14 IMDB: 4.9/10 Rotten Tomatoes: 26/100
  • 15.
    Trust Models forcomplex data • Peter Gibbons hatches a plot to write a computer virus that grab fractions of a penny from a corporate retirement account. http://goo.gl/rDg1U • Known in security circles as a salami attack. 15 IMDB: 7.9/10 Rotten Tomatoes: 79/100 Takeaway point: Little bits of value (information) provide deep insights in the aggregate
  • 16.
    16 1. Aggregation 2. Visualization 3. Correlation New Models of Interpreting (Big) Data Takeaways • Expert based knowledge is no longer sufficient. • Simple mathematical methods create value from captured data
  • 17.
    17 Aggregation (Counting) William Thomson, 1st Baron Kelvin "When you can measure what you are speaking about, and express it in numbers, you know something about it.” Takeaway • Aggregation via counting things is the most common way to exploit Big Data
  • 18.
    Aggregation: A Taleof Two Products The book “Fearless” is much more popular than the 80s movie “Navy Seals.” It also has a more favorable distribution of reviews.
  • 19.
    The distribution we’relooking for looks like the #1 hand: Responses concentrated in the most positive category, With very few responses that were unfavorable. Aggregation: A Tale of Two Products
  • 20.
    Aggregation & Visualization: Counting with Google Trends
  • 21.
    Aggregation & Visualization: Bing Search vs. Google Search
  • 22.
  • 23.
    Aggregation & Visualization: Big Data vs. Britney Spears
  • 24.
    Geospatial Visualization Example: Social Drift in DC Takeaway • Visualization provides a powerful mechanism for Exploratory Data Analysis A
  • 25.
    25 Correlation: CanadianFlu Research Gunther Eysenbach • Professor @ University of Toronto • Focused on eHealth •Google Ads user Infodemiology • 2004-2005 tracked flu related searches • 54,507 Ad impressions in Canada • High R^2 correlation to actual flu activity http://gunther-eysenbach.blogspot.com/ Infodemiology paper: http://goo.gl/aeUZtA Takeaway • Human behavior in response to Google Ads related to the flu was highly correlated with “officially reported” cases of the flu.
  • 26.
    NYT: http://goo.gl/mNyAi7 26 Correlation: Google Flu Trends “Google Flu Trends provides near real-time estimates of flu activity for a number of countries and regions around the world based on aggregated search queries.” Process •Map searches to regions • Quantify “normal” • Detect “anomalies” NPR: http://goo.gl/Iv7A87
  • 27.
    27 Correlation: BoxOffice Hit Prediction “Use of socially generated ‘big data’ to access information about collective states of the minds in human societies has become a new paradigm in the emerging field of computational social science.” Simple factors • number of total page views • number of total edits made • number of users editing • number of revisions in the article's revision history Early Prediction of Movie Box Office Success: http://goo.gl/BWf7H1 Counts of Wikipedia factors correlate to Box Office sales
  • 28.
    28 Big Data: Significance for Social Sciences 1. Proxy variables. Digital exhaust collected for purposes other than survey often creates ‘proxy variables’ that provide complementary insights. 2. Aggregation Insights. Combining many small observations leads to insights that we can trust. 3. Data Linking. It is possible to ‘link’ or synchronize records between digital exhaust and instrumented surveys by selecting a common dimension (e.g. location). The future of social science will involve combining “fuzzy Big Data insights” with instrumented survey results
  • 29.
    Correlation Does NotEqual Causation http://xkcd.com/552/
  • 30.
    The kitchen modelof value creation Chef Ingredients Utensils Recipes Your Staff Your Data Technology Techniques
  • 31.
    31 Take Action: Experiment yourself Exploratory Data Analysis lifecycle: • collect - Twitter API, Datasift.com • clean - open refine • analyze - Python or R • visualize - Google Earth Related data: https://s3.amazonaws.com/devbackup/germany.txt.gz Related code: https://github.com/abeusher
  • 32.
    32 Take Action:Explore Google Trends http://goo.gl/8eJZg Google Ngram http://goo.gl/4U09fa Google Correlate http://goo.gl/nEhe8D Bing Keyword Research http://goo.gl/q2V88g
  • 33.
    33 Contact information Abe Usher Email: abe.usher@gmail.com Twitter: @abeusher LinkedIn: http://goo.gl/DUxZOP Presentations: http://goo.gl/bCa3Qt

Editor's Notes

  • #17 Aggregation is often the first and more important step in synthesizing facts and trends from a large pool of data. Correlation is useful in identifying related spatial features. Beware of spatial auto-correlation! Once a relationship has been quantified during a correlation step, the application of this numeric relationship can be used for course forecasting (anticipatory analysis).