Text Classification in Python – using
   Pandas, scikit-learn, IPython
     Notebook and matplotlib
                     Jimmy Lai
             r97922028 [at] ntu.edu.tw
 http://tw.linkedin.com/pub/jimmy-lai/27/4a/536
                    2013/02/17
Critical Technologies for Big Data
                          Analysis
       User Generated       Machine
          Content        Generated Data
                                          • Please refer
                                            http://www.slideshare.net/jimmy
                  Collecting                _lai/when-big-data-meet-python
                                            for more detail.
                   Storage
Infrastructure
 C/JAVA
                  Computing

Python/R           Analysis

Javascript       Visualization
                                                                             2
Fast prototyping - IPython Notebook
• Write python code in browser:
  – Exploit the remote server resources
  – View the graphical results in web page
  – Sketch code pieces as blocks
  – Refer http://www.slideshare.net/jimmy_lai/fast-data-mining-flow-
    prototyping-using-ipython-notebook for more introduction.




                           Text Classification in Python               3
Demo Code
• Demo Code:
  ipython_demo/text_classification_demo.ipynb
  in https://bitbucket.org/noahsark/slideshare
• Ipython Notebook:
  – Install
  $ pip install ipython
  – Execution (under ipython_demo dir)
  $ ipython notebook --pylab=inline
  – Open notebook with browser, e.g.
     http://127.0.0.1:8888

                    Text Classification in Python   4
Machine learning classification
•   𝑋 𝑖 = [𝑥1 , 𝑥2 , … , 𝑥 𝑛 ], 𝑥 𝑛 ∈ 𝑅
•   𝑦𝑖 ∈ 𝑁
•   𝑑𝑎𝑡𝑎𝑠𝑒𝑡 = 𝑋, 𝑌
•   𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑟 𝑓: 𝑦 𝑖 = 𝑓(𝑋 𝑖 )




                         Text Classification in Python   5
Text classification
         Feature
        Generation

  Model
                                         Feature
Parameter
                                        Selection
  Tuning

      Classification
      Model Training
            Text Classification in Python           6
From: zyeh@caspian.usc.edu (zhenghao yeh)
Subject: Re: Newsgroup Split
Organization: University of Southern California, Los Angeles, CA
Lines: 18
                                                                     Dataset:
Distribution: world                                                  20 newsgroups
NNTP-Posting-Host: caspian.usc.edu       Structured Data                 dataset
In article <1quvdoINN3e7@srvr1.engin.umich.edu>, tdawson@engin.umich.edu
(Chris Herringshaw) writes:
|> Concerning the proposed newsgroup split, I personally am not in favor of
|> doing this. I learn an awful lot about all aspects of graphics by reading
|> this group, from code to hardware to algorithms. I just think making 5
|> different groups out of this is a wate, and will only result in a few posts
|> a week per group. I kind of like the convenience of having one big forum
|> for discussing all aspects of graphics. Anyone else feel this way?
|> Just curious.
|>
|>
|> Daemon
|>
                                                                          Text
I agree with you. Of cause I'll try to be a daemon :-)

Yeh                                  Text Classification in Python               7
USC
Dataset in sklearn
• sklearn.datasets
  – Toy datasets
  – Download data from http://mldata.org repository
• Data format of classification problem
  – Dataset
     • data: [raw_data or numerical]
     • target: [int]
     • target_names: [str]


                      Text Classification in Python   8
Feature extraction from structured
                 data (1/2)
• Count the frequency of
                                                                      Keyword Count
  keyword and select the                                              Distribution 2549
  keywords as features:                                               Summary 397
  ['From', 'Subject',                                                 Disclaimer 125
                                                                      File 257
  'Organization',                                                     Expires 116
  'Distribution', 'Lines']                                            Subject 11612
• E.g.                                                                From 11398
                                                                      Keywords 943
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
                                                                      Originator 291
Organization: University of Maryland, College                         Organization 10872
Park                                                                  Lines 11317
Distribution: None                                                    Internet 140
Lines: 15                                                             To 106



                                      Text Classification in Python                        9
Feature extraction from structured
              data (2/2)
• Separate structured                    • Transform token matrix
  data and text data                        as numerical matrix by
   – Text data start from                   sklearn.feature_extract
     “Line:”                                ionDictVectorizer
                                         • E.g.
                                         [{‘a’: 1, ‘b’: 1}, {‘c’: 1}] =>
                                         [[1, 1, 0], [0, 0, 1]]




                        Text Classification in Python                  10
Text Feature extraction in sklearn
• sklearn.feature_extraction.text
• CountVectorizer
  – Transform articles into token-count matrix
• TfidfVectorizer
  – Transform articles into token-TFIDF matrix
• Usage:
  – fit(): construct token dictionary given dataset
  – transform(): generate numerical matrix

                     Text Classification in Python    11
Text Feature extraction
• Analyzer
  – Preprocessor: str -> str
     • Default: lowercase
     • Extra: strip_accents – handle unicode chars
  – Tokenizer: str -> [str]
     • Default: re.findall(ur"(?u)bww+b“, string)
  – Analyzer: str -> [str]
     1. Call preprocessor and tokenizer
     2. Filter stopwords
     3. Generate n-gram tokens

                       Text Classification in Python    12
Text Classification in Python   13
Feature Selection
• Decrease the number of features:
  – Reduce the resource usage for faster learning
  – Remove the most common tokens and the most
    rare tokens (words with less information):
     • Parameter for Vectorizer:
        – max_df
        – min_df
        – max_features




                         Text Classification in Python   14
Classification Model Training
• Common classifiers in sklearn:
  – sklearn.linear_model
  – sklearn.svm
• Usage:
  – fit(X, Y): train the model
  – predict(X): get predicted Y




                     Text Classification in Python   15
Cross Validation
• When tuning the parameters of model, let
  each article as training and testing data
  alternately to ensure the parameters are not
  dedicated to some specific articles.
  – from sklearn.cross_validation import KFold
  – for train_index, test_index in KFold(10, 2):
     • train_index = [5 6 7 8 9]
     • test_index = [0 1 2 3 4]


                        Text Classification in Python   16
Performance Evaluation
                              𝑡𝑝                     • sklearn.metrics
  • 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
                            𝑡𝑝+𝑓𝑝
                                   – precision_score
               𝑡𝑝
  • 𝑟𝑒𝑐𝑎𝑙𝑙 =                       – recall_score
             𝑡𝑝+𝑓𝑛
                  𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛×𝑟𝑒𝑐𝑎𝑙𝑙 – f1_score
  • 𝑓1𝑠𝑐𝑜𝑟𝑒 = 2
                            𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙




                                    Text Classification in Python        17
Source: http://en.wikipedia.org/wiki/Precision_and_recall
Visualization
1. Matplotlib
2. plot() function of Series, DataFrame




                   Text Classification in Python   18
Experiment Result




• Future works:
  – Feature selection by statistics or dimension reduction
  – Parameter tuning
  – Ensemble models

                      Text Classification in Python      19

Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

  • 1.
    Text Classification inPython – using Pandas, scikit-learn, IPython Notebook and matplotlib Jimmy Lai r97922028 [at] ntu.edu.tw http://tw.linkedin.com/pub/jimmy-lai/27/4a/536 2013/02/17
  • 2.
    Critical Technologies forBig Data Analysis User Generated Machine Content Generated Data • Please refer http://www.slideshare.net/jimmy Collecting _lai/when-big-data-meet-python for more detail. Storage Infrastructure C/JAVA Computing Python/R Analysis Javascript Visualization 2
  • 3.
    Fast prototyping -IPython Notebook • Write python code in browser: – Exploit the remote server resources – View the graphical results in web page – Sketch code pieces as blocks – Refer http://www.slideshare.net/jimmy_lai/fast-data-mining-flow- prototyping-using-ipython-notebook for more introduction. Text Classification in Python 3
  • 4.
    Demo Code • DemoCode: ipython_demo/text_classification_demo.ipynb in https://bitbucket.org/noahsark/slideshare • Ipython Notebook: – Install $ pip install ipython – Execution (under ipython_demo dir) $ ipython notebook --pylab=inline – Open notebook with browser, e.g. http://127.0.0.1:8888 Text Classification in Python 4
  • 5.
    Machine learning classification • 𝑋 𝑖 = [𝑥1 , 𝑥2 , … , 𝑥 𝑛 ], 𝑥 𝑛 ∈ 𝑅 • 𝑦𝑖 ∈ 𝑁 • 𝑑𝑎𝑡𝑎𝑠𝑒𝑡 = 𝑋, 𝑌 • 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑟 𝑓: 𝑦 𝑖 = 𝑓(𝑋 𝑖 ) Text Classification in Python 5
  • 6.
    Text classification Feature Generation Model Feature Parameter Selection Tuning Classification Model Training Text Classification in Python 6
  • 7.
    From: zyeh@caspian.usc.edu (zhenghaoyeh) Subject: Re: Newsgroup Split Organization: University of Southern California, Los Angeles, CA Lines: 18 Dataset: Distribution: world 20 newsgroups NNTP-Posting-Host: caspian.usc.edu Structured Data dataset In article <1quvdoINN3e7@srvr1.engin.umich.edu>, tdawson@engin.umich.edu (Chris Herringshaw) writes: |> Concerning the proposed newsgroup split, I personally am not in favor of |> doing this. I learn an awful lot about all aspects of graphics by reading |> this group, from code to hardware to algorithms. I just think making 5 |> different groups out of this is a wate, and will only result in a few posts |> a week per group. I kind of like the convenience of having one big forum |> for discussing all aspects of graphics. Anyone else feel this way? |> Just curious. |> |> |> Daemon |> Text I agree with you. Of cause I'll try to be a daemon :-) Yeh Text Classification in Python 7 USC
  • 8.
    Dataset in sklearn •sklearn.datasets – Toy datasets – Download data from http://mldata.org repository • Data format of classification problem – Dataset • data: [raw_data or numerical] • target: [int] • target_names: [str] Text Classification in Python 8
  • 9.
    Feature extraction fromstructured data (1/2) • Count the frequency of Keyword Count keyword and select the Distribution 2549 keywords as features: Summary 397 ['From', 'Subject', Disclaimer 125 File 257 'Organization', Expires 116 'Distribution', 'Lines'] Subject 11612 • E.g. From 11398 Keywords 943 From: lerxst@wam.umd.edu (where's my thing) Subject: WHAT car is this!? Originator 291 Organization: University of Maryland, College Organization 10872 Park Lines 11317 Distribution: None Internet 140 Lines: 15 To 106 Text Classification in Python 9
  • 10.
    Feature extraction fromstructured data (2/2) • Separate structured • Transform token matrix data and text data as numerical matrix by – Text data start from sklearn.feature_extract “Line:” ionDictVectorizer • E.g. [{‘a’: 1, ‘b’: 1}, {‘c’: 1}] => [[1, 1, 0], [0, 0, 1]] Text Classification in Python 10
  • 11.
    Text Feature extractionin sklearn • sklearn.feature_extraction.text • CountVectorizer – Transform articles into token-count matrix • TfidfVectorizer – Transform articles into token-TFIDF matrix • Usage: – fit(): construct token dictionary given dataset – transform(): generate numerical matrix Text Classification in Python 11
  • 12.
    Text Feature extraction •Analyzer – Preprocessor: str -> str • Default: lowercase • Extra: strip_accents – handle unicode chars – Tokenizer: str -> [str] • Default: re.findall(ur"(?u)bww+b“, string) – Analyzer: str -> [str] 1. Call preprocessor and tokenizer 2. Filter stopwords 3. Generate n-gram tokens Text Classification in Python 12
  • 13.
  • 14.
    Feature Selection • Decreasethe number of features: – Reduce the resource usage for faster learning – Remove the most common tokens and the most rare tokens (words with less information): • Parameter for Vectorizer: – max_df – min_df – max_features Text Classification in Python 14
  • 15.
    Classification Model Training •Common classifiers in sklearn: – sklearn.linear_model – sklearn.svm • Usage: – fit(X, Y): train the model – predict(X): get predicted Y Text Classification in Python 15
  • 16.
    Cross Validation • Whentuning the parameters of model, let each article as training and testing data alternately to ensure the parameters are not dedicated to some specific articles. – from sklearn.cross_validation import KFold – for train_index, test_index in KFold(10, 2): • train_index = [5 6 7 8 9] • test_index = [0 1 2 3 4] Text Classification in Python 16
  • 17.
    Performance Evaluation 𝑡𝑝 • sklearn.metrics • 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑡𝑝+𝑓𝑝 – precision_score 𝑡𝑝 • 𝑟𝑒𝑐𝑎𝑙𝑙 = – recall_score 𝑡𝑝+𝑓𝑛 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛×𝑟𝑒𝑐𝑎𝑙𝑙 – f1_score • 𝑓1𝑠𝑐𝑜𝑟𝑒 = 2 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙 Text Classification in Python 17 Source: http://en.wikipedia.org/wiki/Precision_and_recall
  • 18.
    Visualization 1. Matplotlib 2. plot()function of Series, DataFrame Text Classification in Python 18
  • 19.
    Experiment Result • Futureworks: – Feature selection by statistics or dimension reduction – Parameter tuning – Ensemble models Text Classification in Python 19