Data Science Toolchain
presented by Jie-Han Chen
slide: https://goo.gl/1hXBGk
Language & Software
Python
R
Java
Matlab
Octave
Jupyter Notebook
Python
Open Source Community
Package
Web Service
Good Readability
Machine Learning
R
Open Source Community
Built-in Statistics Package
Standalone computing &
data analysis
Slower than Python
High Performance
Big Data
Poor Visualization,
Modeling
Java
Matlab & Octave
Powerful built-in math functions
Simple Data Visualization tool
Prototyping
-5
0
5
10
-10 -10
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
-5
0
5
10
Jupyter Notebook
Support 40+ programming language.
eg: Python, R, Scala...
Excellent for sharing your experiments
Markdown, Latex
example1
example2
Language & Software
Python
R
Java
Matlab
Octave
Jupyter Notebook
Data Science Roadmap
Data Science Toolchains
Data Collection
Data Visualization
Data Storage
Algorithm & Modeling
Data Collection
Using API: Facebook, Wikipedia
Web Scraper
Web Scraper
Web Scraper
HTTP request + HTML Parser
HTTP: python-requests
Better than built-in urllib
Sessions with Cookie Persistence
Thread-safety
HTTP: python-requests
HTTP: python-requests
Web page
Parser!
Regular Expression?
BeautifulSoup
HTML/XML parser
BeautifulSoup
Ptt
HTML parser
More Powerful Tool?
Scrapy
An open source and collaborative framework for
extracting the data you need from websites.In a
fast, simple, yet extensible way.
Scrapy
$ scrapy startproject tutorial
Scrapy
path: /scrapy/dmoz.py
crawler name: dmoz
Scrapy
Scrapy
$ scrapy crawl dmoz
Scrapy
robots.txt
youtube.com/robots.txt
"I believe that visualization is one of the most
powerful means of achieving personal goals."
Harvey Mackay
Data Visualization
Data Visualization
Matplotlib, ggplot2
D3.js
Bokeh
Tableau
PlotDB
Leaflet
Matplotlib
ggplot2
D3.js
Data Visualization Project
Interacive
Web frontend
example1
example2
Bokeh
Python, R, Scala, Julia
Interactive
Jupyter Notebook
Tableau
Tableau
( )
code
Data Source
Data Visulization
Code
Programming
Using GeoJSON with Leaflet
, Configurable
Using GeoJSON with Leaflet
S3
1. Key-value
2. Permission
3. Data Visualization
4. Big Data (Spark)
Algorithm
&
Modeling
Algorithm & Modeling
python-numpy + python-pandas + scikit-learn
libsvm
spark-Mlib
Weka
Deep Learning
Numpy + Pandas
+ Scikit-learn
Numpy
C
Numpy - data structure
ndarray (n-dim array)
ndim
size
shape
dtype
Numpy
generate matrix
Numpy
generate matrix
Numpy
generate matrix
Numpy
generate matrix
Numpy
generate matrix
Numpy - linalg
numpy
Series, DataFrame
: csv, json ...
nan
Series -
Series -
Series -
Series -
DataFrame - Series
Pandas - import
Pandas - import
Pandas - import
Pandas - import
Pandas - NaN
Pandas - NaN
Pandas - NaN
Pandas - operation
Merge
Grouping
Reshaping
. . .
Dataset
Feature Engineering
Modeling
Evaluation
LIBSVM
C
Easy to use
Support many programming languages
Dataset
LIBSVM - install
$ git clone
LIBSVM - install
$ make
LIBSVM - workflow
LIBSVM - data format
label
index , attribute
value , attribute
LIBSVM - data format
LIBSVM - toy
MLlib
MLlib
, Hadoop
Java, Scala, R, Python
MLlib
MLlib
, Hadoop
Java, Scala, R, Python
Classification: logistic regression, naive Bayes,...
Regression: generalized linear regression, survival regression,...
Decision trees, random forests, and gradient-boosted trees
Recommendation: alternating least squares (ALS)
Clustering: K-means, Gaussian mixtures (GMMs),...
Topic modeling: latent Dirichlet allocation (LDA)
Frequent itemsets, association rules, and sequential pattern
mining
MLlib
Feature transformations: standardization,
normalization, hashing,...
ML Pipeline construction
Model evaluation and hyper-parameter tuning
ML persistence: saving and loading models and
Pipelines
MLlib
MLlib
, Hadoop
Java, Scala, R, Python
MLlib
Weka
Java library
Big Data
Support GUI
Deep Learning
Theano
Pylearn2
Keras
Tensorflow
Caffe
Deeplearning4J
...
Theano
Base on Numpy
Implemented by Cython
Dynamic C code generation
GPU & CUDA
tensor, math expression
A CPU and GPU Math Compiler in
Python
Theano tutorial:
http://www.slideshare.net/SergiiGavrylov/theano-tutorial
Keras
Theano, Tensorflow
Support GPU
prototype
High-level neural networks library
Tool ?
Homework
Github repo Data science
Database, Social Network Analytics, ML library, Deep
Learning Platform ...
READM.md: Repo
Demo Code
email: ita3051@gmail.com
Google
https://goo.gl/forms/PQPz8u2glyunQvfM2​

Data science-toolchain