Jonathan Dinu

Co-Founder, Zipfian Academy

jonathan@zipfianacademy.com

@clearspandex
@ZipfianAcademy
Data Engineering 101: Building your first
data product
May 4th, 2014
Today
• whoami	

• Nws Rdr (News Reader)	

• The What,Why, and How of Data Products	

• Data Engineering	

• Building a Pipeline	

• Productionizing the Products	

• Q&A
Questions? tweet @zipfianacademy #pydata
Formerly
Questions? tweet @zipfianacademy #pydata
Formerly
Questions? tweet @zipfianacademy #pydata
Currently
Questions? tweet @zipfianacademy #pydata
Today Disclaimer:
All characters appearing in this presentation are
fictitious. Any resemblance to real persons, living
or dead, is purely coincidental.
Questions? tweet @zipfianacademy #pydata
Today Disclaimer:
This presentation contains strong opinions that
you may or may not agree with. All thoughts are
my own.
Jonathan Dinu

Co-Founder, Zipfian Academy

jonathan@zipfianacademy.com

@clearspandex
Questions? tweet @zipfianacademy #pydata
Today
• whoami	

• Nws Rdr (News Reader)	

• The What,Why, and How of Data Products	

• Data Engineering	

• Building a Pipeline	

• Productionizing the Products	

• CreatingValue for Users	

• Q&A
Questions? tweet @zipfianacademy #pydata
nwsrdr (News Reader)
Source: http://www.groovypost.com/wp-content/uploads/2013/05/Bookmark-
Button.png
OR
nwsrdr
+ nwrsrdr
+ nwrsrdr
+ nwrsrdr
nwsrdr
getnews.com/bookmarklet
When browsing the web simply click the 

+nwsrdr to save any page to nwsrdr
Get nwsrdr on your desktop
Questions? tweet @zipfianacademy #pydata
nwsrdr
• Auto-categorize Articles	

• Find Similar Articles	

• Recommend articles	

• Suggest Feeds to Follow	

• No Ads!
It’s like Prismatic + Pocket + Google Reader (RIP) + Delicious!
Questions? tweet @zipfianacademy #pydata
nwsrdr
It’s like Prismatic + Pocket + Google Reader (RIP) + Delicious!
• Naive Bayes (classification)	

• Clustering (unsupervised learning)	

• Collaborative Filtering	

• Triangle Closing	

• Real Business Model!
Questions? tweet @zipfianacademy #pydata
Today
• whoami	

• Nws Rdr (News Reader)	

• The What,Why, and How of Data Products	

• Data Engineering	

• Building a Pipeline	

• Productionizing the Products	

• Q&A
Questions? tweet @zipfianacademy #pydata
OR
Data Products
Product Built on Data
(that you sell)
Questions? tweet @zipfianacademy #pydata
OR
Data Products
Product that Generates Data
Questions? tweet @zipfianacademy #pydata
OR
Data Products
Product that Generates Data
(that you sell)
Questions? tweet @zipfianacademy #pydata
OR
Data Products
Product that Generates Data
(that you sell)
i.e. Facebook
Questions? tweet @zipfianacademy #pydata
OR
Data Products
Questions? tweet @zipfianacademy #pydata Source: http://gifgif.media.mit.edu/
OR
Data Products
Source: http://www.adamlaiacano.com/post/57703317453/data-generating-productsQuestions? tweet @zipfianacademy #pydata
OR
Data Generating
Products
Source: http://www.adamlaiacano.com/post/57703317453/data-generating-productsQuestions? tweet @zipfianacademy #pydata
Products that enhance a users’
experience the more “data” a user
provides
Data Generating
Products
Ex: Recommender Systems
Questions? tweet @zipfianacademy #pydata
Today
• whoami	

• Nws Rdr (News Reader)	

• The What,Why, and How of Data Products	

• Data Engineering	

• Building a Pipeline	

• Productionizing the Products	

• Q&A
Questions? tweet @zipfianacademy #pydata
OR
Data Science
Questions? tweet @zipfianacademy #pydata
i.e. solve more problems than you create
Data Science
Questions? tweet @zipfianacademy #pydata
Source: http://estoyentretenido.com/wp-content/uploads/2012/11/Jackie-Chan-
Meme.jpg
But.... How?!?!?!!?
Data Science
Questions? tweet @zipfianacademy #pydata
Data Engineering
Source: http://www.schooljotter.com/imagefolders/lady/Class_3/Engineer-
It-1350063721.PNG
Questions? tweet @zipfianacademy #pydata
Data Engineering
Source: http://www.schooljotter.com/imagefolders/lady/Class_3/Engineer-
It-1350063721.PNG
!
Questions? tweet @zipfianacademy #pydata
OR
Data Engineering
Questions? tweet @zipfianacademy #pydata
Prepared
Data
Test Set
Training	

Set Train
Model
Sampling
Evaluate
Cross 	

Validation
Data Science
Questions? tweet @zipfianacademy #pydata
Raw
Data
Cleaned
Data
Scrubbing
Prepared
DataVectorization
New
Data
Test Set
Training	

Set Train
Model
Sampling
Evaluate
Cross 	

Validation
Cleaned
Data
Prepared
DataVectorizationScrubbing
Predict
Labels/
Classes
Data Engineering
Questions? tweet @zipfianacademy #pydata
Today
• whoami	

• Nws Rdr (News Reader)	

• The What,Why, and How of Data Products	

• Data Engineering	

• Building a Pipeline	

• Productionizing the Products	

• Q&A
Questions? tweet @zipfianacademy #pydata
What
• Naive Bayes (classification)	

• Clustering (unsupervised learning)	

• Collaborative Filtering	

• Triangle Closing	

• Real Business Model
Questions? tweet @zipfianacademy #pydata
nwsrdr
• Auto-categorize Articles	

• Find Similar Articles	

• Recommend articles	

• No Ads!
It’s like Prismatic + Pocket + Google Reader (RIP) + Delicious!
Questions? tweet @zipfianacademy #pydata
nwsrdr
It’s like Prismatic + Pocket + Google Reader (RIP) + Delicious!
• Naive Bayes (classification)	

• Clustering (unsupervised learning)	

• Collaborative Filtering	

• Triangle Closing	

• Real Business Model!
Questions? tweet @zipfianacademy #pydata
Source: http://media.tumblr.com/tumblr_lakcynCyG31qbzcoy.jpg
Abstraction (Cake)
How
(ABK)
Questions? tweet @zipfianacademy #pydata
Obligatory
Name Drop
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
requests
BeautifulSoup4
pandas
pymongo
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
Obligatory
Name Drop
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
requests
BeautifulSoup4
pandas
pymongo
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
Obligatory
Name Drop
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
requests
BeautifulSoup4
pandas
pymongo
Flask
yHat
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
Obligatory
Name Drop
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
requests
BeautifulSoup4
pandas
pymongo
Flask
yHat
At Scale Locally
scrapy
Hadoop Streaming 	

(w/ BeautifulSoup4)
mrjob or 	

Mortar (w/ Python UDF)
Snakebite (HDFS)
MLlib (pySpark)
Flask
yHat
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
Obligatory
Name Drop
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
At Scale
Flask
yHat
scrapy
Hadoop Streaming 	

(w/ BeautifulSoup4)
mrjob or 	

Mortar (w/ Python UDF)
Snakebite (HDFS)
MLlib (pySpark)
requests
BeautifulSoup4
pandas
pymongo
Flask
yHat
Locally
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
Pipeline
Iteration 0:
• Find out how much data	

• Run locally	

• Experiment
Questions? tweet @zipfianacademy #pydata
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
At Scale
Flask
yHat
scrapy
Hadoop Streaming 	

(w/ BeautifulSoup4)
mrjob or 	

Mortar (w/ Python UDF)
Snakebite (HDFS)
MLlib (pySpark)
requests
BeautifulSoup4
pandas
pymongo
Flask
yHat
Locally
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
Acquire
Retrieve Meta-data for ALL NYT articles
Questions? tweet @zipfianacademy #pydata
Acquire
api_key='xxxxxxxxxxxxx'!
!
!
!
url = 'http://api.nytimes.com/svc/search/v2/
articlesearch.json?fq=section_name.contains:("Arts"
"Business Day" "Opinion" "Sports" "U.S."
"World")&sort=newest&api-key=' + api_key!
!
!
!
# make an API request!
api = requests.get(url)!
Questions? tweet @zipfianacademy #pydata
Acquire
# parse resulting JSON and insert into a mongoDB collection!
for content in api.json()['response']['docs']:!
if not collection.find_one(content):!
collection.insert(content)!
!
!
# only returns 10 per page!
"There are only %i docuemtns returned 0_o" % !
! len(api.json()[‘response']['docs'])!
Questions? tweet @zipfianacademy #pydata
Acquire
# there are many more than 10 articles however!
total_art = articles_left = api.json()['response']['meta']['hits']!
!
!
print "There are currently %s articles in the NYT archive" % total_art!
!
!
#=> There are currently 15277775 articles in the NYT archive
Questions? tweet @zipfianacademy #pydata
Acquire
Gotchas!
• Rate Limiting	

• Page Limiting
Questions? tweet @zipfianacademy #pydata
Acquire
Iterate
Iteration 1:
• (Meaningful) Sample of Data	

• Prototype — “Close the Loop”
Questions? tweet @zipfianacademy #pydata
Retrieve Meta-data for ALL NYT articles
Questions? tweet @zipfianacademy #pydata
Acquire
(take 2)
# let us loop (and hopefully not hit our rate limit)!
while articles_left > 0 and page_count < max_pages:!
more_articles = requests.get(url + "&page=" + str(page) + "&end_date=" + str(last_date))!
print "Inserting page " + str(page)!
# make sure it was successful!
if more_articles.status_code == 200:!
for content in more_articles.json()['response']['docs']:!
latest_article = parser.parse(content['pub_date']).strftime("%Y%m%d")!
if not collection.find_one(content) and content['document_type'] == 'article':!
print "No dups"!
try:!
print "Inserting article " + str(content['headline'])!
collection.insert(content)!
except errors.DuplicateKeyError:!
print "Duplicates"!
continue!
else:!
print "In collection already”!
! ! …
Iteration 0.5
Questions? tweet @zipfianacademy #pydata
Acquire
articles_left -= 10!
page += 1!
page_count += 1!
cursor_count += 1!
final_page = max(final_page, page)!
else:!
if more_articles.status_code == 403:!
print "Sleepy..."!
# account for rate limiting!
time.sleep(2)!
elif cursor_count > 100:!
print "Adjusting date”!
! ! ! ! # account for page limiting!
cursor_count = 0!
page = 0!
last_date = latest_article!
else:!
print "ERRORS: " + str(more_articles.status_code)!
cursor_count = 0!
page = 0!
last_date = latest_article!
Questions? tweet @zipfianacademy #pydata
Acquire
Download HTML content of 	

articles from NYT.com
Questions? tweet @zipfianacademy #pydata
Acquire
(and store in MongoDB™)
Acquire
# now we can get some content!!
#limit = 100!
limit = 10000!
!
for article in collection.find({'html' : {'$exists' : False} }):!
if limit and limit > 0:!
if not article.has_key('html') and article['document_type'] == 'article':!
limit -= 1!
print article['web_url']!
html = requests.get(article['web_url'] + "?smid=tw-nytimes")!
!
if html.status_code == 200:!
soup = BeautifulSoup(html.text)!
!
# serialize html!
collection.update({ '_id' : article['_id'] }, { '$set' : !
! ! ! ! ! ! ! ! ! ! ! ! ! { 'html' : unicode(soup), 'content' : [] } !
! ! ! ! ! ! ! ! ! ! ! ! } )!
!
for p in soup.find_all('div', class_='articleBody'):!
collection.update({ '_id' : article['_id'] }, { '$push' : !
! ! ! ! ! ! ! ! ! ! ! ! ! ! { 'content' : p.get_text() !
! ! ! ! ! ! ! ! ! ! ! ! ! } })!
Questions? tweet @zipfianacademy #pydata
Parse
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
At Scale
Flask
yHat
scrapy
Hadoop Streaming 	

(w/ BeautifulSoup4)
mrjob or 	

Mortar (w/ Python UDF)
Snakebite (HDFS)
MLlib (pySpark)
requests
BeautifulSoup4
pandas
pymongo
scikit-learn/NLTK
Flask
yHat
Locally
Questions? tweet @zipfianacademy #pydata
Parse HTML with BeautifulSoup	

and Extract the article Body
Questions? tweet @zipfianacademy #pydata
(and store in MongoDB™)
Parse
# parse HTML content of articles!
for article in collection.find({'html' : {'$exists' : True} }):!
print article['web_url']!
soup = BeautifulSoup(article['html'], 'html.parser')!
arts = soup.find_all('div', class_='articleBody')!
!
if len(arts) == 0:!
arts = soup.find_all('p', class_=‘story-body-text')!
!
! ! …
Questions? tweet @zipfianacademy #pydata
Parse
Store
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
At Scale
Flask
yHat
scrapy
Hadoop Streaming 	

(w/ BeautifulSoup4)
mrjob or 	

Mortar (w/ Python UDF)
Snakebite (HDFS)
MLlib (pySpark)
requests
BeautifulSoup4
pandas
pymongo
Flask
yHat
Locally
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
for p in arts:!
collection.update({ '_id' : article['_id'] }, { '$push' : !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! { 'content' : p.get_text() } !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! })!
Questions? tweet @zipfianacademy #pydata
Store
Explore
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
At Scale
Flask
yHat
scrapy
Hadoop Streaming 	

(w/ BeautifulSoup4)
mrjob or 	

Mortar (w/ Python UDF)
Snakebite (HDFS)
MLlib (pySpark)
requests
BeautifulSoup4
pandas
pymongo
Flask
yHat
Locally
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
Exploratory Data Analysis with pandas
Questions? tweet @zipfianacademy #pydata
Explore
articles.describe()!
# ! ! text section!
# count 1405 1405!
# unique 1397 10!
!
fig = plt.figure()!
# histogram of section counts!
articles['section'].value_counts().plot(kind='bar')
Questions? tweet @zipfianacademy #pydata
Explore
Questions? tweet @zipfianacademy #pydata
Explore
error with 	

NYT API
Questions? tweet @zipfianacademy #pydata
Explore
api_key='xxxxxxxxxxxxx'!
!
!
!
url = 'http://api.nytimes.com/svc/search/v2/
articlesearch.json?fq=section_name.contains:("Arts"
"Business Day" "Opinion" "Sports" "U.S."
"World")&sort=newest&api-key=' + api_key!
!
!
!
# make an API request!
api = requests.get(url)!
Questions? tweet @zipfianacademy #pydata
Explore
error with 	

NYT API
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
At Scale
Flask
yHat
scrapy
Hadoop Streaming 	

(w/ BeautifulSoup4)
mrjob or 	

Mortar (w/ Python UDF)
Snakebite (HDFS)
MLlib (pySpark)
requests
BeautifulSoup4
pandas
pymongo
Flask
yHat
Locally
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
Vectorize
Tokenize article text and 	

create feature vectors with NLTK
Questions? tweet @zipfianacademy #pydata
Vectorize
Vectorize
wnl = nltk.WordNetLemmatizer()!
!
def tokenize_and_normalize(chunks):!
words = [ tokenize.word_tokenize(sent) for sent in
tokenize.sent_tokenize("".join(chunks)) ]!
flatten = [ inner for sublist in words for inner in sublist ]!
stripped = [] !
!
for word in flatten: !
if word not in stopwords.words('english'):!
try:!
stripped.append(word.encode('latin-1').decode('utf8').lower())!
except:!
print "Cannot encode: " + word!
!
no_punks = [ word for word in stripped if len(word) > 1 ] !
return [wnl.lemmatize(t) for t in no_punks]!
Questions? tweet @zipfianacademy #pydata
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
At Scale
Flask
yHat
scrapy
Hadoop Streaming 	

(w/ BeautifulSoup4)
mrjob or 	

Mortar (w/ Python UDF)
Snakebite (HDFS)
MLlib (pySpark)
requests
BeautifulSoup4
pandas
pymongo
Flask
yHat
Locally
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
Train
Train and score a model with scikit-learn
Questions? tweet @zipfianacademy #pydata
Train
# cross validate!
from sklearn.cross_validation import train_test_split!
!
xtrain, xtest, ytrain, ytest = !
! ! ! ! ! ! ! train_test_split(X, labels, test_size=0.3)!
!
# train a model!
alpha = 1!
multi_bayes = MultinomialNB(alpha=alpha)!
!
multi_bayes.fit(xtrain, ytrain)!
multi_bayes.score(xtest, ytest)
Questions? tweet @zipfianacademy #pydata
Train
Gotchas!
• Model only exists locally on Laptop	

• Not Automated for realtime prediction
Questions? tweet @zipfianacademy #pydata
Train
Exposé
Questions? tweet @zipfianacademy #pydata
Iteration 2:
• Expose your model	

• Automate your processes
Questions? tweet @zipfianacademy #pydata
Exposé
Getting that model	

off your lap(top)
Questions? tweet @zipfianacademy #pydata
Exposé
Source: http://pixel.nymag.com/imgs/daily/vulture/2012/03/09/09_joan-taylor.o.jpg/
a_560x0.jpg
Questions? tweet @zipfianacademy #pydata
Exposé
A model is just a function
Questions? tweet @zipfianacademy #pydata
Exposé
Inputs...
Questions? tweet @zipfianacademy #pydata
Exposé
Outputs...
Questions? tweet @zipfianacademy #pydata
Exposé
Serialize your model with pickle 	

(or cPickle or joblib)
Questions? tweet @zipfianacademy #pydata
Persistence
Source: http://www.glogster.com/mrsallenballard/pickles-i-love-em-/
g-6mevh13be74mgnc9i8qifa0
Persistence
Questions? tweet @zipfianacademy #pydata
Persistence
SerDes
• Disk	

• Database	

• Memory	

Questions? tweet @zipfianacademy #pydata
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
At Scale
Flask
yHat
scrapy
Hadoop Streaming 	

(w/ BeautifulSoup4)
mrjob or 	

Mortar (w/ Python UDF)
Snakebite (HDFS)
MLlib (pySpark)
requests
BeautifulSoup4
pandas
pymongo
Flask
yHat
Locally
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
Exposé
Deploy your Model to yHat
Questions? tweet @zipfianacademy #pydata
Exposé
class DocumentClassifier(YhatModel):!
@preprocess(in_type=dict, out_type=dict)!
def execute(self, data):!
featureBody = vectorizer.transform([data['content']])!
result = multi_bayes.predict(featureBody)!
list_res = result.tolist()!
return {"section_name": list_res}!
!
clf = DocumentClassifier()!
yh = Yhat("jonathan@zipfianacademy.com", “xxxxxx",!
! ! ! ! ! ! ! ! ! ! ! ! ! "http://cloud.yhathq.com/")!
yh.deploy("documentClassifier", DocumentClassifier, globals())
Questions? tweet @zipfianacademy #pydata
Exposé
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
At Scale
Flask
yHat
scrapy
Hadoop Streaming 	

(w/ BeautifulSoup4)
mrjob or 	

Mortar (w/ Python UDF)
Snakebite (HDFS)
MLlib (pySpark)
requests
BeautifulSoup4
pandas
pymongo
Flask (on Heroku)
yHat
Locally
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
Present
Create a Flask application to 	

expose your model on the web
Questions? tweet @zipfianacademy #pydata
Present
yh = Yhat("<USERNAME>", "<API KEY>", "http://cloud.yhathq.com/")	
!
@app.route('/')	
def index():	
return app.send_static_file('index.html')	
!
@app.route('/predict', methods=['POST'])	
def predict():	
article = request.form['article']	
results = yh.predict("documentClf", { 'content': article })	
return jsonify({"results": results})	
Questions? tweet @zipfianacademy #pydata
Present
Pipeline
Only Data should Flow
Questions? tweet @zipfianacademy #pydata
Data
Remember to Remember	

(Lineage)
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
Questions? tweet @zipfianacademy #pydata
Pipeline
Immutable append only set of Raw Data
Computation is a view on data
*Lambda Architecture by Nathan MarzQuestions? tweet @zipfianacademy #pydata
Pipeline
Functional Data Science
• Modularity	

• Define interfaces	

• Separate data from computation	

• Data Lineage
Functional
Questions? tweet @zipfianacademy #pydata
Need Robust and Flexible Pipeline!
Questions? tweet @zipfianacademy #pydata
Pipeline
Whatever you do, DO NOT cross the streams
Questions? tweet @zipfianacademy #pydata
Pipeline
NYT
API
MongoDB
BeautifulSoup
Feature
Matrixscikit-learn
Web
App
Model
Deploy
yHat
Heroku
POST
Predict
Predicted
Section
Where we are
NLTK
scikit-learn
Questions? tweet @zipfianacademy #pydata
Gotchas!
• Only have a static subset of articles	

• Pipeline not automated for re-training
Questions? tweet @zipfianacademy #pydata
Gotchas
Today
• whoami	

• Nws Rdr (News Reader)	

• The What,Why, and How of Data Products	

• Data Engineering	

• Building a Pipeline	

• Productionizing the Products	

• Q&A
Questions? tweet @zipfianacademy #pydata
Iteration 3:
Source: http://vninja.net/wordpress/wp-content/uploads/2013/03/KCaAutomate.pngQuestions? tweet @zipfianacademy #pydata
Iterate
NYT
API
MongoDB
cron
Feature
Matrixscikit-learn
Web
App
Model
Deploy
yHat
Heroku
POST
Predict
Predicted
Section
Where we are
NLTK
scikit-learn
Questions? tweet @zipfianacademy #pydata
Amazon 	

EC2
testing
Start small (data)
and fast
(development)
testing
Increase size of
data set
Optimize and
productionize
PROFIT!
$$$
Questions? tweet @zipfianacademy #pydata
How to Scale
How to Scale
testing
Develop locally
testing
Distribute
computation 	

(run on cluster)
Tune parameters
PROFIT!
$$$
Questions? tweet @zipfianacademy #pydata
Can also use a
streaming algorithm or
single machine disk
based “medium data”
technologies (i.e.
database or memory
mapped files)
Products
If you build it...
Questions? tweet @zipfianacademy #pydata
Source: http://nateemery.com/wp-content/uploads/2013/05/field-of-dreams.jpg
Products
Questions? tweet @zipfianacademy #pydata
Today
• whoami	

• Nws Rdr (News Reader)	

• The What,Why, and How of Data Products	

• Data Engineering	

• Building a Pipeline	

• Productionizing the Products	

• Q&A
Questions? tweet @zipfianacademy #pydata
Q & A
Q&A
Questions? tweet @zipfianacademy #pydata
Zipfian
Academy
@ZipfianAcademy
Data Science & Data Engineering	

12-week Bootcamp (May 12th & Sep 8th)
Weekend Workshops
http://zipfianacademy.com/apply
http://zipfianacademy.com/workshops
Next: InteractiveVisualizations w/ d3.js (June 7th)
Questions? tweet @zipfianacademy #pydata
Thank You!
Jonathan Dinu

Co-Founder, Zipfian Academy

jonathan@zipfianacademy.com

@clearspandex
@ZipfianAcademy
http://zipfianacademy.com
Questions? tweet @zipfianacademy #pydata
Appendix
Questions? tweet @zipfianacademy #pydata
Data Sources
Obtain
(ranked by ease of use)
1. DaaS -- Data as a service	

2. Bulk Download	

3. APIs	

4. Web Scraping
Questions? tweet @zipfianacademy #pydata
DaaS
(Data as a Service)
•Time Series/Numeric: Quandl	

• Financial Modeling: Quantopian	

• Email Contextualization: Rapleaf	

• Location and POI: Factual
Data Sources
Questions? tweet @zipfianacademy #pydata
Bulk Download
(just like the good ol’ days)
• File Transfer Protocol (FTP): CDC	

•Amazon Web Services: Public Datasets	

• Infochimps: Data Marketplace	

•Academia: UCI Machine Learning Repository
Data Sources
Questions? tweet @zipfianacademy #pydata
APIs
(if it’s not RESTed, I’m not buying)
• Geographic: Foursquare	

• Social: Facebook	

•Audio: Rdio	

• Content:Tumblr	

• Realtime:Twitter 	

• Hidden:Yahoo Finance
Data Sources
Questions? tweet @zipfianacademy #pydata
Web Scraping
1. wget and curl 	

2. Web Spider/Crawler	

3. API scraping	

4. Manual Download
(DIY for life)
Data Sources
Questions? tweet @zipfianacademy #pydata
• DelimitedValues	

• TSV	

• CSV	

• WSV	

• JSON	

• XML	

• Ad Hoc Formats (avoid these if you can)
Data Formats
Questions? tweet @zipfianacademy #pydata
• JSON is made up of hash tables and arrays	

• Hash tables: { “foo” : 1, “bar” : 2, baz : “3” }	

• Arrays: [1, 2, 3]
• Arrays of arrays: [[1, 2, 3], [‘foo’, ‘bar’, ‘baz’]]
• Array of hashes: [{‘foo’:1, ‘bar’:2}, {‘baz’:3}]
• Hashes of hashes: {‘foo’: {‘bar’: 2, ‘baz’: 3}}
Questions? tweet @zipfianacademy #pydata
Data Formats
{"widget": {!
"debug": "on",!
"window": {!
"title": "Sample Konfabulator Widget",!
"name": "main_window",!
"width": 500,!
"height": 500!
},!
"image": { !
"src": "Images/Sun.png",!
"name": "sun1",!
"hOffset": 250,!
"vOffset": 250,!
"alignment": "center"!
},!
"text": {!
"data": "Click Here",!
"size": 36,!
"style": "bold",!
"name": "text1",!
"hOffset": 250,!
"vOffset": 100,!
"alignment": "center",!
"onMouseUp": "sun1.opacity = (sun1.opacity / 100) * 90;"!
}!
}} !
Questions? tweet @zipfianacademy #pydata
Data Formats
• XML is a recursive self-describing container	

<container>
<item>Foo</item>
<item>Bar</item>
<container>
<item attr=”SomethingAboutBaz”>Baz</item>
</container>
</item>
<container>
Questions? tweet @zipfianacademy #pydata
Data Formats
<widget>!
<debug>on</debug>!
<window title="Sample Konfabulator Widget">!
<name>main_window</name>!
<width>500</width>!
<height>500</height>!
</window>!
<image src="Images/Sun.png" name="sun1">!
<hOffset>250</hOffset>!
<vOffset>250</vOffset>!
<alignment>center</alignment>!
</image>!
<text data="Click Here" size="36" style="bold">!
<name>text1</name>!
<hOffset>250</hOffset>!
<vOffset>100</vOffset>!
<alignment>center</alignment>!
<onMouseUp>!
sun1.opacity = (sun1.opacity / 100) * 90;!
</onMouseUp>!
</text>!
</widget>!
Questions? tweet @zipfianacademy #pydata
Data Formats
• Ad hoc data formats	

• Fixed-width (Census data)	

• Graph Edgelists
• Voting records
• etc.
Questions? tweet @zipfianacademy #pydata
Data Formats
• 7-5-5 format	

•Sam foo bar!
•Roger baz 6!
•Jane 314 99
Questions? tweet @zipfianacademy #pydata
Data Formats
• Directed Graph Format	

1 2!
1 3!
1 4!
2 3!
4 4
Questions? tweet @zipfianacademy #pydata
Data Formats
• Directed Graph Format	

1 2!
1 3!
1 4!
2 3!
4 4
Questions? tweet @zipfianacademy #pydata
Data Formats
Programming languages like
Python, Ruby, and R have built in
parsers for data formats such as
JSON and CSV. For other
esoteric formats you will
probably have to write your own
Questions? tweet @zipfianacademy #pydata
Data Formats

Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014