Big Data and the Social Sciences

Big Data Technology
and the Social Sciences:
A Lecture at Mannheim University
Abe Usher CCHP, CISSP
Chief Technology Officer, HumanGeo

2
What’s In It For You?
Theory
• Definitions and overview
•Where data are being generated
Practice
• Google’s three secret techniques*
for unlocking insights from data
•The kitchen model
•Recommended resources to build
data science skills
Presentation slides:
http://www.slideshare.net/abeusher/big-data-and-the-social-sciences
*Not specifically endorsed by Google. Also, not really a secret.

3
Background
HumanGeo is focused on digital Human Geography:
 Understanding the location attributes of individuals and groups
 And the social attributes of locations
 Through ‘Big Data’ analysis of billions geolocated data elements

4
Big Data Wake-Up Call
Berkeley University Research http://goo.gl/zjSUr1
By 2016 the rate of data growth surpasses the rate of Moore’s Law

5
Defining Big Data
http://knowyourmeme.com/memes/you-keep-using-that-word-i-do-not-think-it-means-what-you-think-it-means

6
Big Data Definition
Boring Traditional definition
“High volume, velocity and variety
information assets that demand
cost-effective, innovative forms of
information processing for
enhanced insight and decision
making.”

7
Big Data Definition
Abe’s definition:

8
The Original “Big Data”
1880 US Census
• 50 million people
•Data included: age, gender, number
of insane people in household*
•Took 7 years to tabulate
• 1890 Census estimated at 13 years to
complete
*Credit to Ken Krugler for this factoid: http://www.censusrecords.com/content/1880_census

9
The Original “Big Data”
1880 US Census
•Data included: age, gender, number
of insane people in household*
•Took 7 years to tabulate
• 1890 Census estimated at 13 years to
complete
1890
•Additional data: citizenship and
military service
•New technology: Hollerith Tabulating
System
•Took 6 weeks to tabulate (76x faster)
Takeaway
• Better technology and methodology led
to 76x speedup
*Credit to Ken Krugler for this factoid: http://www.censusrecords.com/content/1880_census

10
Data Generation
Where are data created?
•Website interaction logs
•Social Media
•Cyber events
• Smartphones
What is the volume?
•3B phone calls in USA
• 700M Facebook posts
• 500M tweets per day
• 50B WhatsApp messages per day
Takeaway
• Social media, telecommunication,
and instant messaging generate an
increasingly high volume of data

11
Traditional Model
of Interpreting Observations
Tracy Morrow (aka “Ice T”)
How can you identify a
legitimate hip-hop artist
(versus someone who just gets
up and rhymes)?
http://www.npr.org/2005/08/30/4824690/original-gangster-rapper-and-actor-ice-t

12
Traditional Model
up and rhymes)?
“Game knows game, baby.”

13
Traditional Model
up and rhymes)?
“If you have expert knowledge,
then you are capable of
answering complex questions
by interpreting domain specific
information.” [paraphrased]

Trust Models for complex data
• August Gorman carried out a
plot to grab fractions of a
penny from a corporate payroll
system. http://goo.gl/vAScel
14
IMDB: 4.9/10
Rotten Tomatoes: 26/100

Trust Models for complex data
• Peter Gibbons hatches a plot
to write a computer virus that
grab fractions of a penny from
a corporate retirement
account. http://goo.gl/rDg1U
• Known in security circles as a
salami attack.
15
IMDB: 7.9/10
Rotten Tomatoes: 79/100
Takeaway point: Little bits of value (information)
provide deep insights in the aggregate

16
1. Aggregation
2. Visualization
3. Correlation
New Models of
Interpreting (Big) Data
Takeaways
• Expert based knowledge is no
longer sufficient.
• Simple mathematical methods
create value from captured data

17
Aggregation
(Counting)
William Thomson, 1st Baron Kelvin
"When you can measure
what you are speaking
about, and express it in
numbers, you know
something about it.”
Takeaway
• Aggregation via counting things
is the most common way to
exploit Big Data

Aggregation:
A Tale of Two Products
The book “Fearless” is much more popular than the 80s movie “Navy Seals.”
It also has a more favorable distribution of reviews.

The distribution we’re looking for looks like the #1 hand:
Responses concentrated in the most positive category,
With very few responses that were unfavorable.
Aggregation:
A Tale of Two Products

Aggregation & Visualization:
Counting with Google Trends

Bing Search vs. Google Search

Aggregation:
Diet Pepsi vs. Diet Coke

Big Data vs. Britney Spears

Geospatial Visualization Example:
Social Drift in DC
Takeaway
• Visualization provides a
powerful mechanism for
Exploratory Data Analysis
A

25
Correlation:
Canadian Flu Research
Gunther Eysenbach
• Professor @ University of
Toronto
• Focused on eHealth
•Google Ads user
Infodemiology
• 2004-2005 tracked flu related
searches
• 54,507 Ad impressions in
Canada
• High R^2 correlation to actual
flu activity
http://gunther-eysenbach.blogspot.com/
Infodemiology paper: http://goo.gl/aeUZtA
Takeaway
• Human behavior in response to
Google Ads related to the flu was
highly correlated with “officially
reported” cases of the flu.

NYT: http://goo.gl/mNyAi7
26
Correlation:
Google Flu Trends
“Google Flu Trends provides near
real-time estimates of flu activity
for a number of countries and
regions around the world based on
aggregated search queries.”
Process
•Map searches to regions
• Quantify “normal”
• Detect “anomalies”
NPR: http://goo.gl/Iv7A87

27
Correlation:
Box Office Hit Prediction
“Use of socially generated ‘big
data’ to access information
about collective states of the
minds in human societies has
become a new paradigm in the
emerging field of computational
social science.”
Simple factors
• number of total page views
• number of total edits made
• number of users editing
• number of revisions in the
article's revision history
Early Prediction of Movie Box Office Success: http://goo.gl/BWf7H1
Counts of Wikipedia factors correlate to Box Office sales

28
Big Data:
Significance for Social Sciences
1. Proxy variables.
Digital exhaust collected for purposes other than survey often creates
‘proxy variables’ that provide complementary insights.
2. Aggregation Insights.
Combining many small observations leads to insights that we can trust.
3. Data Linking.
It is possible to ‘link’ or synchronize records between digital exhaust and
instrumented surveys by selecting a common dimension (e.g. location).
The future of social science will involve combining
“fuzzy Big Data insights” with instrumented survey results

Correlation Does Not Equal Causation
http://xkcd.com/552/

The kitchen model of value creation
Chef Ingredients Utensils Recipes
Your
Staff
Your
Data
Technology Techniques

31
Take Action:
Experiment yourself
Exploratory Data Analysis lifecycle:
• collect - Twitter API, Datasift.com
• clean - open refine
• analyze - Python or R
• visualize - Google Earth
Related data:
https://s3.amazonaws.com/devbackup/germany.txt.gz
Related code: https://github.com/abeusher

32
Take Action: Explore
Google Trends http://goo.gl/8eJZg Google Ngram http://goo.gl/4U09fa
Google Correlate http://goo.gl/nEhe8D Bing Keyword Research http://goo.gl/q2V88g

33
Contact information
Abe Usher
Email: abe.usher@gmail.com
Twitter: @abeusher
LinkedIn: http://goo.gl/DUxZOP
Presentations: http://goo.gl/bCa3Qt

Big Data and the Social Sciences

More Related Content

What's hot

Viewers also liked

Similar to Big Data and the Social Sciences

Recently uploaded

Big Data and the Social Sciences

Editor's Notes