Fast data mining flow prototyping
     using IPython Notebook
            2013/01/31
             Jimmy Lai
      r97922028 [at] ntu.edu.tw
Outline
1.   Workflow for data mining
2.   What IPython Notebook provides
3.   Exemplified by text classification
4.   Demo code and Notebook usage




                       IPython Notebook   2
Workflow for data mining
• Traditional programming workflow:
  – Edit -> Compile -> Run
• Data Mining workflow:
  – Execute -> Explore
  – Consists of many data processing stages and we
    may do trials in each stage with different methods.
  – Stages: data parsing, feature extraction, feature
    selection, model training, model predicting, post
    processing, etc.
                      IPython Notebook                3
What IPython Notebook provides
• Interactive Web IDE
  – Display rich data like plots by matplotlib, math
    symbols by latex
  – Code cell for sketching
  – Execute piece of code in arbitrarily order
  – Browser interface for programming remotely
  – Easy to demonstrate code and execution result in html
    or PDF.
• IPython Notebook makes sketching data analysis
  easily.

                        IPython Notebook                4
Demo code and Notebook usage
• Demo Code: ipython_demo directory in
  https://bitbucket.org/noahsark/slideshare
• Ipython Notebook:
  – Install
  $ pip install ipython
  – Execution (under ipython_demo dir)
  $ ipython notebook --pylab=inline
  – Open notebook with browser, e.g.
    http://127.0.0.1:8888

                     IPython Notebook         5
IPython Note Interface




        IPython Notebook   6
Exemplified by text classification
• Text classification on newsgroup dataset.
• Dataset:
  – Build in sklearn.datasets
  – Each article belongs to one of the 20 groups
• Goal: classify article to one of the newsgroup
  name.
• Experiment: feature generation using different
  ngram parameters.
                      IPython Notebook             7
talk.politics.mideast
Example article




     IPython Notebook                       8
IPython Notebook   9
Sample result of feature extraction




              IPython Notebook    10
Table of experiment setups




          IPython Notebook   11
IPython Notebook   12
Experiment Result




      IPython Notebook   13
IPython Notebook   14
Observation from plots




        IPython Notebook   15

Fast data mining flow prototyping using IPython Notebook

  • 1.
    Fast data miningflow prototyping using IPython Notebook 2013/01/31 Jimmy Lai r97922028 [at] ntu.edu.tw
  • 2.
    Outline 1. Workflow for data mining 2. What IPython Notebook provides 3. Exemplified by text classification 4. Demo code and Notebook usage IPython Notebook 2
  • 3.
    Workflow for datamining • Traditional programming workflow: – Edit -> Compile -> Run • Data Mining workflow: – Execute -> Explore – Consists of many data processing stages and we may do trials in each stage with different methods. – Stages: data parsing, feature extraction, feature selection, model training, model predicting, post processing, etc. IPython Notebook 3
  • 4.
    What IPython Notebookprovides • Interactive Web IDE – Display rich data like plots by matplotlib, math symbols by latex – Code cell for sketching – Execute piece of code in arbitrarily order – Browser interface for programming remotely – Easy to demonstrate code and execution result in html or PDF. • IPython Notebook makes sketching data analysis easily. IPython Notebook 4
  • 5.
    Demo code andNotebook usage • Demo Code: ipython_demo directory in https://bitbucket.org/noahsark/slideshare • Ipython Notebook: – Install $ pip install ipython – Execution (under ipython_demo dir) $ ipython notebook --pylab=inline – Open notebook with browser, e.g. http://127.0.0.1:8888 IPython Notebook 5
  • 6.
    IPython Note Interface IPython Notebook 6
  • 7.
    Exemplified by textclassification • Text classification on newsgroup dataset. • Dataset: – Build in sklearn.datasets – Each article belongs to one of the 20 groups • Goal: classify article to one of the newsgroup name. • Experiment: feature generation using different ngram parameters. IPython Notebook 7
  • 8.
  • 9.
  • 10.
    Sample result offeature extraction IPython Notebook 10
  • 11.
    Table of experimentsetups IPython Notebook 11
  • 12.
  • 13.
    Experiment Result IPython Notebook 13
  • 14.
  • 15.
    Observation from plots IPython Notebook 15