0

I am trying to do some classification on customer emails.

  1. Is the email happy or sad (sentiment analysis)
  2. Is the email related to billing or not.

I am using Python3 and think I have to use nltk and scikit NLTK - will help understand and read the text I beleive scikit - will do the classification (happy, sad and billing or not)

Training data set 1: A few phrases...anywhere from one word to a sentence with 5 to 6 words. (1 being happy and 0 being not happy)...a few examples below

  • Apprecaite the help..1
  • great job..1
  • Awesome..1
  • terrible..0
  • confusing...0
  • slow down...0

Training data set 2: a few phrases indicating billing related question..(few examples below)

  • question on my bill
  • billing fee
  • my bill is too high
  • payment rejected

Now this seems to be straight forward from a concept stand point where can I find some basic code, that will tell me

  1. how I can use my own training data
  2. how I can load the email text as input and spit out an answer happy or sad...and billing or not.
1

1 Answer 1

3

Regarding your data sets, your approach is nearly lexicon-based as the items contains very few words.

For billing, the lexicon-based approach should be a good idea. You should give importance to the subjects of the emails.

For sentiment analysis you have two options:

  • Machine learning: In this case you should use a bigger data set (in my view, each item should be a full email). You can implement a Naive Bayes classifier following this tutorial.

  • Lexicon-based approach: There are several lexicons for sentiment analysis e.g. SentiWordNet (downloadable from nltk.download()), MPQA, SentiStrength, WordNet-Affect via WNAffect,... Preprocessings: tokenization (nltk.word_tokenize()) and POS tagging (nltk.pos_tag(text)). You should also think about negation (polarity shifting is a good approach to manage with negation).

Machine Learning provide best results so if you have enough annotated emails it is the good choice.

Sign up to request clarification or add additional context in comments.

3 Comments

thx Clemtoy..! further question to followup on the lexicon based approach.(billing) .I'm going to be using NLTK to derive meaning full data from my text(remove stop words..etc)...then do I simply compare words to my own training data ? (billing phrases ?) #1. compare single words with single words in my training data....#2 compare bi grams with 2 word phrases from my data...#3 compare n (3 word) grams with n (4 words) in my training data...and then 4 words till I am thinkin maybe 7 word phrases is the max I have for now..ex. "I have a question on my bill".. so guess I look and compare ngrams?
You can try to do this yes!
btw...other than emails...which is going to be a small portion of my data...majority is going to be phone calls transcribed to text...will keep my fingers crossed !

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.