Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

Text Classification in Python – using
Pandas, scikit-learn, IPython
Notebook and matplotlib
Jimmy Lai
r97922028 [at] ntu.edu.tw
http://tw.linkedin.com/pub/jimmy-lai/27/4a/536
2013/02/17

Critical Technologies for Big Data
Analysis
User Generated Machine
Content Generated Data
• Please refer
http://www.slideshare.net/jimmy
Collecting _lai/when-big-data-meet-python
for more detail.
Storage
Infrastructure
C/JAVA
Computing

Python/R Analysis

Javascript Visualization
2

Fast prototyping - IPython Notebook
• Write python code in browser:
– Exploit the remote server resources
– View the graphical results in web page
– Sketch code pieces as blocks
– Refer http://www.slideshare.net/jimmy_lai/fast-data-mining-flow-
prototyping-using-ipython-notebook for more introduction.

Text Classification in Python 3

Demo Code
• Demo Code:
ipython_demo/text_classification_demo.ipynb
in https://bitbucket.org/noahsark/slideshare
• Ipython Notebook:
– Install
$ pip install ipython
– Execution (under ipython_demo dir)
$ ipython notebook --pylab=inline
– Open notebook with browser, e.g.
http://127.0.0.1:8888


Machine learning classification
• 𝑋 𝑖 = [𝑥1 , 𝑥2 , … , 𝑥 𝑛 ], 𝑥 𝑛 ∈ 𝑅
• 𝑦𝑖 ∈ 𝑁
• 𝑑𝑎𝑡𝑎𝑠𝑒𝑡 = 𝑋, 𝑌
• 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑟 𝑓: 𝑦 𝑖 = 𝑓(𝑋 𝑖 )


Text classification
Feature
Generation

Model
Feature
Parameter
Selection
Tuning

Classification
Model Training

From: zyeh@caspian.usc.edu (zhenghao yeh)
Subject: Re: Newsgroup Split
Organization: University of Southern California, Los Angeles, CA
Lines: 18
Dataset:
Distribution: world 20 newsgroups
NNTP-Posting-Host: caspian.usc.edu Structured Data dataset
In article <1quvdoINN3e7@srvr1.engin.umich.edu>, tdawson@engin.umich.edu
(Chris Herringshaw) writes:
|> Concerning the proposed newsgroup split, I personally am not in favor of
|> doing this. I learn an awful lot about all aspects of graphics by reading
|> this group, from code to hardware to algorithms. I just think making 5
|> different groups out of this is a wate, and will only result in a few posts
|> a week per group. I kind of like the convenience of having one big forum
|> for discussing all aspects of graphics. Anyone else feel this way?
|> Just curious.
|>
|>
|> Daemon
|>
Text
I agree with you. Of cause I'll try to be a daemon :-)

Yeh Text Classification in Python 7
USC

Dataset in sklearn
• sklearn.datasets
– Toy datasets
– Download data from http://mldata.org repository
• Data format of classification problem
– Dataset
• data: [raw_data or numerical]
• target: [int]
• target_names: [str]


Feature extraction from structured
data (1/2)
• Count the frequency of
Keyword Count
keyword and select the Distribution 2549
keywords as features: Summary 397
['From', 'Subject', Disclaimer 125
File 257
'Organization', Expires 116
'Distribution', 'Lines'] Subject 11612
• E.g. From 11398
Keywords 943
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Originator 291
Organization: University of Maryland, College Organization 10872
Park Lines 11317
Distribution: None Internet 140
Lines: 15 To 106


Feature extraction from structured
data (2/2)
• Separate structured • Transform token matrix
data and text data as numerical matrix by
– Text data start from sklearn.feature_extract
“Line:” ionDictVectorizer
• E.g.
[{‘a’: 1, ‘b’: 1}, {‘c’: 1}] =>
[[1, 1, 0], [0, 0, 1]]


Text Feature extraction in sklearn
• sklearn.feature_extraction.text
• CountVectorizer
– Transform articles into token-count matrix
• TfidfVectorizer
– Transform articles into token-TFIDF matrix
• Usage:
– fit(): construct token dictionary given dataset
– transform(): generate numerical matrix


Text Feature extraction
• Analyzer
– Preprocessor: str -> str
• Default: lowercase
• Extra: strip_accents – handle unicode chars
– Tokenizer: str -> [str]
• Default: re.findall(ur"(?u)bww+b“, string)
– Analyzer: str -> [str]
1. Call preprocessor and tokenizer
2. Filter stopwords
3. Generate n-gram tokens


Feature Selection
• Decrease the number of features:
– Reduce the resource usage for faster learning
– Remove the most common tokens and the most
rare tokens (words with less information):
• Parameter for Vectorizer:
– max_df
– min_df
– max_features


Classification Model Training
• Common classifiers in sklearn:
– sklearn.linear_model
– sklearn.svm
• Usage:
– fit(X, Y): train the model
– predict(X): get predicted Y


Cross Validation
• When tuning the parameters of model, let
each article as training and testing data
alternately to ensure the parameters are not
dedicated to some specific articles.
– from sklearn.cross_validation import KFold
– for train_index, test_index in KFold(10, 2):
• train_index = [5 6 7 8 9]
• test_index = [0 1 2 3 4]


Performance Evaluation
𝑡𝑝 • sklearn.metrics
• 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑡𝑝+𝑓𝑝
– precision_score
𝑡𝑝
• 𝑟𝑒𝑐𝑎𝑙𝑙 = – recall_score
𝑡𝑝+𝑓𝑛
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛×𝑟𝑒𝑐𝑎𝑙𝑙 – f1_score
• 𝑓1𝑠𝑐𝑜𝑟𝑒 = 2
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙

Source: http://en.wikipedia.org/wiki/Precision_and_recall

Visualization
1. Matplotlib
2. plot() function of Series, DataFrame


Experiment Result

• Future works:
– Feature selection by statistics or dimension reduction
– Parameter tuning
– Ensemble models


Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

More Related Content

What's hot

Viewers also liked

Similar to Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

More from Jimmy Lai

Recently uploaded

Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib