TEXT CLASSIFICATION
using
SUPPORT VECTOR MACHINES in R
by
Sai Srinivas Kotni
[14BM60083]
Under the guidance of
Prof. Susmita Mukhopadhyay
 Automated text classification has been considered as a vital method to manage
and process a vast amount of documents in digital forms that are widespread and
continuously increasing. In general, text classification plays an important role in
information extraction and summarization, text retrieval, and question answering.
Objective:
 Create an efficient Support-vector machines model for text
classification/categorization
 Measure its performance
Problem Statement
Text classification (text categorization): Assign documents to one or more predefined categories
 Applications of Text Classification
 Organize web pages into hierarchies
 Domain-specific information extraction
 Sort email in to different folders
 Find interests of users
Common Methods
 Manual classification
 Automatic document classification
 Supervised learning of document-label assignment function
 Naive Bayes (simple, common method)
 k-Nearest Neighbors (simple, powerful)
 Support-vector machines (new, more powerful) and many more
Introduction
Sport
Science
Theory
Art
Examples
 Labels may be domain-specific binary
e.g., "interesting-to-me" : "not-interesting-to-me”
e.g., “spam” : “not-spam”
e.g., “contains adult language” :“doesn’t”
 LABELS=TOPICS
 “finance” / “sports” / “asia”
 Given:
 A description of an instance, xX, where X is the instance language or instance space.
 E.g: how to represent text documents.
 A fixed set of categories C = {c1, c2,…, cn}
 Determine:
 The category of x: c(x)C, where c(x) is a categorization function whose domain is X and
whose range is C.
 LABELS=OPINION
 “like” / “hate” / “neutral”
 LABELS=AUTHOR
 “Shakespeare” / “Marlowe” / “Ben Jonson”
 Labels may be genres
 e.g., "editorials" "movie-reviews" "news“
Assign labels to each document or web-page
Decision Tree model
 Decision Tree (DT):
 Tree where the root and each internal node is labeled with a question.
 The arcs represent each possible answer to the associated question.
 Each leaf node represents a prediction of a solution to the problem.
 Popular technique for classification; Leaf node indicates class to which the corresponding
tuple belongs.
 A Decision Tree Model is a computational model consisting of three parts:
 Decision Tree
 Algorithm to create the tree
 Algorithm that applies the tree to data
 Creation of the tree is the most difficult part.
 Processing is basically a search similar to that in a binary search tree (although DT may not
be binary).
 The decision tree approach to classification is to divide the search space into rectangular
region. A tuple is classified based on the region into which it falls.
Naive Bayes Algorithm
Formula
Naive Bayes algorithm works on conditional probability i.e.
Where p(Ck|x) – Is the probability whethere the
tweet has positive/negative sentiment
P(Ck) – Probability of Negative/Positive
dataframe
P(x|Ck) – Probability of every word in tweet as
positive or negative
Where K – positive negative
P(xi|Ck) – probability of bag of words – P(x1,x2,x3,x4)
Sentiment with highest probability value will be selected
Logic behind the Model
•Say suppose we’ve trained the model using a excel file containing 10 tweets which consist of
3 positive tweets and 7 negative tweets.
• Probability (Positive tweets) = 0.3 Probability (Negative tweets) =0.7
• Say suppose out tweet is “I had an awesome experience”.
Say suppose the strings in this particular tweet are represented by x1, x2, x3, x4, x5.
Probability (Pos/strings of data(say x1 x2 x3 x4 x5)) =
P(Pos)*P(x1/pos)*P(x2/pos)*P(x3/pos)*P(x4/pos)*P(x5/pos) ----------------------- 1
Probability (Neg/strings of data(say x1 x2 x3 x4 x5)) =
P(Neg)*P(x1/neg)*P(x2/neg)*P(x3/neg)*P(x4/neg)*P(x5/neg) --------------------- 2
Where
If 1 > 2, the text will be classified as a positive one and if otherwise, negative tweet.
Where Nk – No of time x1 repeated in positive dataframe repository
N – Total number of words in positive dataframe repository including redundancy
D – Total distinct words including positive & negative database repository
8
Support Vector Machines
Main idea of SVMs
 Find out the linear separating hyperplane which
maximize the margin, i.e., the optimal separating
hyperplane (OSH)
Supervised learning
Support vector machines are based on the Structural Risk Minimization principle from
computational learning theory. The idea of structural risk minimization is to find a hypothesis h for
which we can guarantee the lowest true error.
Why Should SVMs Work Well for Text Categorization ?
• High dimensional input space
• Document vectors are sparse
• Few irrelevant features
• Most text categorization problems are linearly separable
Methodology
Documents
Preprocessing
Indexing and
Feature
selection
Applying SVM
classification
algorithm
Performance
measure
Transform documents into a suitable representation for
classification task
• Remove HTML or other tags
• Remove stopwords
• Perform word stemming
Indexing by different weighing schemes:
• Boolean weighing
• Word frequency weighing
Feature selection: Remove non-informative terms from
documents
• improve classification effectiveness
• reduce computational complexity
• K-Nearest-Neighbor algorithm (KNN)
• Decision Tree algorithm (DT)
• Naive Bayes algorithm (NB)
• Support Vector Machine (SVM)
Performance of algorithm:
– Training time
– Testing time
– Classification accuracy
 Each document is a vector, one component for each term (= word).
 Normalize to unit length.
 High-dimensional vector space:
 Terms are axes
 10,000+ dimensions, or even 100,000+
 Docs are vectors in this space
 Each training doc a point (vector) labeled by its topic (= class)
 Hypothesis: docs of the same class form a contiguous region of space
 We define surfaces to delineate classes in space
 The set of records available for developing classification methods is divided
into two disjoint subsets- a training set and a test set.
Process
SVM model implementation in R
 Prepare the algorithm to classify the text documents
 Train and Test the model
 Measure the performance of SVM model
Things to perform..
Packages to be used in R
RTextTools
e1071(SVM), rpart
tm, Stringr, Plyr
arules
LITERATURE REVIEW
Title of the literature Author, Journal and
Publication
date
Learnings from
the literature
Link
Text Categorization with
Support Vector Machines:
Learning with Many Relevant
Features
Thorsten Joachims University at
Dortmund
Informatik LS8,
Baroper Str. 301
44221
Dortmund,
Germany
About the the
particular
properties of
learning with text
data and
identifies why
SVMs are
appropriate
PDF
Automatic Text Categorization
and Its Application to Text
Retrieval
Wai Lam, Miguel
Ruiz, and Padmini
Srinivasan
NOVEMBER/D
ECEMBER
1999
The application
of automatic
categorization to
text retrieval
PDF
SVM Tutorial Alexandre
KOWALCZYK
23 November,
2015
How to classify
text in R
PDF
Thank You

Presentation on Text Classification

  • 1.
    TEXT CLASSIFICATION using SUPPORT VECTORMACHINES in R by Sai Srinivas Kotni [14BM60083] Under the guidance of Prof. Susmita Mukhopadhyay
  • 2.
     Automated textclassification has been considered as a vital method to manage and process a vast amount of documents in digital forms that are widespread and continuously increasing. In general, text classification plays an important role in information extraction and summarization, text retrieval, and question answering. Objective:  Create an efficient Support-vector machines model for text classification/categorization  Measure its performance Problem Statement
  • 3.
    Text classification (textcategorization): Assign documents to one or more predefined categories  Applications of Text Classification  Organize web pages into hierarchies  Domain-specific information extraction  Sort email in to different folders  Find interests of users Common Methods  Manual classification  Automatic document classification  Supervised learning of document-label assignment function  Naive Bayes (simple, common method)  k-Nearest Neighbors (simple, powerful)  Support-vector machines (new, more powerful) and many more Introduction Sport Science Theory Art
  • 4.
    Examples  Labels maybe domain-specific binary e.g., "interesting-to-me" : "not-interesting-to-me” e.g., “spam” : “not-spam” e.g., “contains adult language” :“doesn’t”  LABELS=TOPICS  “finance” / “sports” / “asia”  Given:  A description of an instance, xX, where X is the instance language or instance space.  E.g: how to represent text documents.  A fixed set of categories C = {c1, c2,…, cn}  Determine:  The category of x: c(x)C, where c(x) is a categorization function whose domain is X and whose range is C.  LABELS=OPINION  “like” / “hate” / “neutral”  LABELS=AUTHOR  “Shakespeare” / “Marlowe” / “Ben Jonson”  Labels may be genres  e.g., "editorials" "movie-reviews" "news“ Assign labels to each document or web-page
  • 5.
    Decision Tree model Decision Tree (DT):  Tree where the root and each internal node is labeled with a question.  The arcs represent each possible answer to the associated question.  Each leaf node represents a prediction of a solution to the problem.  Popular technique for classification; Leaf node indicates class to which the corresponding tuple belongs.  A Decision Tree Model is a computational model consisting of three parts:  Decision Tree  Algorithm to create the tree  Algorithm that applies the tree to data  Creation of the tree is the most difficult part.  Processing is basically a search similar to that in a binary search tree (although DT may not be binary).  The decision tree approach to classification is to divide the search space into rectangular region. A tuple is classified based on the region into which it falls.
  • 6.
    Naive Bayes Algorithm Formula NaiveBayes algorithm works on conditional probability i.e. Where p(Ck|x) – Is the probability whethere the tweet has positive/negative sentiment P(Ck) – Probability of Negative/Positive dataframe P(x|Ck) – Probability of every word in tweet as positive or negative Where K – positive negative P(xi|Ck) – probability of bag of words – P(x1,x2,x3,x4) Sentiment with highest probability value will be selected
  • 7.
    Logic behind theModel •Say suppose we’ve trained the model using a excel file containing 10 tweets which consist of 3 positive tweets and 7 negative tweets. • Probability (Positive tweets) = 0.3 Probability (Negative tweets) =0.7 • Say suppose out tweet is “I had an awesome experience”. Say suppose the strings in this particular tweet are represented by x1, x2, x3, x4, x5. Probability (Pos/strings of data(say x1 x2 x3 x4 x5)) = P(Pos)*P(x1/pos)*P(x2/pos)*P(x3/pos)*P(x4/pos)*P(x5/pos) ----------------------- 1 Probability (Neg/strings of data(say x1 x2 x3 x4 x5)) = P(Neg)*P(x1/neg)*P(x2/neg)*P(x3/neg)*P(x4/neg)*P(x5/neg) --------------------- 2 Where If 1 > 2, the text will be classified as a positive one and if otherwise, negative tweet. Where Nk – No of time x1 repeated in positive dataframe repository N – Total number of words in positive dataframe repository including redundancy D – Total distinct words including positive & negative database repository
  • 8.
    8 Support Vector Machines Mainidea of SVMs  Find out the linear separating hyperplane which maximize the margin, i.e., the optimal separating hyperplane (OSH) Supervised learning Support vector machines are based on the Structural Risk Minimization principle from computational learning theory. The idea of structural risk minimization is to find a hypothesis h for which we can guarantee the lowest true error. Why Should SVMs Work Well for Text Categorization ? • High dimensional input space • Document vectors are sparse • Few irrelevant features • Most text categorization problems are linearly separable
  • 9.
    Methodology Documents Preprocessing Indexing and Feature selection Applying SVM classification algorithm Performance measure Transformdocuments into a suitable representation for classification task • Remove HTML or other tags • Remove stopwords • Perform word stemming Indexing by different weighing schemes: • Boolean weighing • Word frequency weighing Feature selection: Remove non-informative terms from documents • improve classification effectiveness • reduce computational complexity • K-Nearest-Neighbor algorithm (KNN) • Decision Tree algorithm (DT) • Naive Bayes algorithm (NB) • Support Vector Machine (SVM) Performance of algorithm: – Training time – Testing time – Classification accuracy
  • 10.
     Each documentis a vector, one component for each term (= word).  Normalize to unit length.  High-dimensional vector space:  Terms are axes  10,000+ dimensions, or even 100,000+  Docs are vectors in this space  Each training doc a point (vector) labeled by its topic (= class)  Hypothesis: docs of the same class form a contiguous region of space  We define surfaces to delineate classes in space  The set of records available for developing classification methods is divided into two disjoint subsets- a training set and a test set. Process
  • 11.
    SVM model implementationin R  Prepare the algorithm to classify the text documents  Train and Test the model  Measure the performance of SVM model Things to perform.. Packages to be used in R RTextTools e1071(SVM), rpart tm, Stringr, Plyr arules
  • 12.
    LITERATURE REVIEW Title ofthe literature Author, Journal and Publication date Learnings from the literature Link Text Categorization with Support Vector Machines: Learning with Many Relevant Features Thorsten Joachims University at Dortmund Informatik LS8, Baroper Str. 301 44221 Dortmund, Germany About the the particular properties of learning with text data and identifies why SVMs are appropriate PDF Automatic Text Categorization and Its Application to Text Retrieval Wai Lam, Miguel Ruiz, and Padmini Srinivasan NOVEMBER/D ECEMBER 1999 The application of automatic categorization to text retrieval PDF SVM Tutorial Alexandre KOWALCZYK 23 November, 2015 How to classify text in R PDF
  • 13.