Handling Text Data
INAFU6513 Lecture 7b
Lab 7: your 5-7 things
Get familiar with text processing
Get familiar with text data
Read text data
Classify text data
Analyse text data
Text processing
● Information retrieval
○ Search
○ Named entity recognition
● Learning
○ Classification
○ Clustering
○ Topic identification/ topic following
○ Sentiment analysis
○ Network analysis (words, people etc)
Reading Text Data
Text Data Sources
● Messages (tweets, emails, sms messages...)
● Document text (reports, blogposts, website text…)
● Audio (via speech-to-text processing)
● Images (via OCR)
Get your raw text data
fsipa = open('sipatext.txt', 'r')
sipatext = fsipa.read()
fsipa.close()
print(sipatext)
Counting: Bags of Words
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
word_counts = count_vect.fit_transform([sipatext])
print('{}'.format(word_counts))
print('{}'.format(count_vect.vocabulary_))
Counting sets of words: N-Grams
● Pairs (or triples, 4s etc) of words
● Also: pairs etc of characters, e.g. [‘mor’, ‘ore’, ‘re ‘,
‘e t’, ‘ th’, ‘tha’, ‘han’]
● Know your Ns:
○ ‘Unigram’ == 1-gram
○ ‘Bigram’ == 2-gram
○ ‘Trigram’ == 3-gram
count_vectn = CountVectorizer(ngram_range =(2, 2))
Stopwords
count_vect2 =
CountVectorizer(stop_words='english')
word_counts2 =
count_vect2.fit_transform([sipatext])
Term Frequencies
● TF: Term Frequency:
○ word count / (number of words in this document)
○ “How important (0 to 1) is this word to this document”?
● IDF: Inverse Document Frequency
○ 1 / (number of documents this word appears in)
○ “How common is this word in this corpus”?
● TFIDF:
○ TF * IDF
Machine Learning with Text Data
Classifying Text
Words are a valid input to machine learning algorithms
In this example, we’re using:
● Newsgroup emails as samples (‘rows’ in our input)
● Words in each email as features (‘columns’)
● Newsgroup ids as targets
The 20newsgroups dataset
from sklearn.datasets import fetch_20newsgroups
cats = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups( subset='train', categories=cats)
twenty_test = fetch_20newsgroups(subset='test', categories=cats)
Example email
Convert words to TFIDF scores
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
tfidf_transformer = TfidfTransformer(use_idf=True)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
Fit your model to the data
from sklearn.naive_bayes import MultinomialNB
nb_classifier = MultinomialNB().fit(X_train_tfidf, twenty_train.target)
Test your model
docs_test = ['God is love', 'OpenGL on the GPU is fast']
X_new_counts = count_vect.transform(docs_test)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
predicted = nb_classifier.predict(X_new_tfidf)
for doc, category in zip(docs_test, predicted):
print('{} => {}'.format(doc, twenty_train.target_names[category]))
Text Clustering
We can also ‘cluster’ documents
● The ‘distance’ function is based on the words they have in common
Common machine learning algorithms for text clustering include:
● Latent Semantic Analysis
● Latent Dirichlet Allocation
Text Analysis
Word colocation
● Create a graph (network visualisation) of words that appear together in
documents
● Use network analysis (later session) to show which pairs of words are
important in your documents
Sentiment analysis
● Mark documents (e.g. tweets) as having positive or negative sentiment
● Using machine learning
○ Training set: sentences, with ‘positive’/’negative’ for each sentence
● Using a sentiment dictionary
○ Positive or negative ‘score’ for each emotive word
○ Sentiment dictionaries can be used as machine learning algorithms
‘seeds’
Named Entity Recognition
● Find the names of people, organisations, locations etc in text
● Can use these to create social graphs (networks showing how people etc
connect to each other) and find ‘hubs’, ‘connectors’ etc
Natural Language Processing
Natural Language Processing
● Understanding the grammar and meaning of text
● Useful for, e.g. translation between languages
● Python library: NLTK
Getting started with NLTK
import nltk
nltk.download()
Get text ready for NLTK processing
from nltk import word_tokenize
from nltk.text import Text
fsipa = open('example_data/sipatext.txt', 'r')
sipatext = fsipa.read()
fsipa.close()
sipawords = word_tokenize(sipatext)
textlist = Text(sipawords)
NLTK: concordance
textlist.concordance(‘school’)
textlist.similar('school')
textlist.common_contexts(['school', 'university'])
NLTK: word dispersion plots
from nltk.book import *
text2.dispersion_plot(['Elinor', 'Willoughby', 'Sophia'])
NLTK: Word Meanings
from nltk.corpus import wordnet as wn
word = 'class'
synset = wn.synsets(word)
print('Synset: {}n'.format(synset))
for i in range(len(synset)):
print('Meaning {}: {} {}'.format(i, synset[i].lemma_names(), synset[i].definition()))
NLTK: Synsets
NLTK: converting words into logic
from nltk import load_parser
parser = load_parser('grammars/book_grammars/simple-sem.fcfg', trace=0)
sentence = 'Angus gives a bone to every dog'
tokens = sentence.split()
for tree in parser.parse(tokens):
print(tree.label()['SEM'])
Exercises
Exercises
Try the code in the 7.x series notebooks

Session 07 text data.pptx

  • 1.
  • 2.
    Lab 7: your5-7 things Get familiar with text processing Get familiar with text data Read text data Classify text data Analyse text data
  • 3.
    Text processing ● Informationretrieval ○ Search ○ Named entity recognition ● Learning ○ Classification ○ Clustering ○ Topic identification/ topic following ○ Sentiment analysis ○ Network analysis (words, people etc)
  • 4.
  • 5.
    Text Data Sources ●Messages (tweets, emails, sms messages...) ● Document text (reports, blogposts, website text…) ● Audio (via speech-to-text processing) ● Images (via OCR)
  • 6.
    Get your rawtext data fsipa = open('sipatext.txt', 'r') sipatext = fsipa.read() fsipa.close() print(sipatext)
  • 7.
    Counting: Bags ofWords from sklearn.feature_extraction.text import CountVectorizer count_vect = CountVectorizer() word_counts = count_vect.fit_transform([sipatext]) print('{}'.format(word_counts)) print('{}'.format(count_vect.vocabulary_))
  • 8.
    Counting sets ofwords: N-Grams ● Pairs (or triples, 4s etc) of words ● Also: pairs etc of characters, e.g. [‘mor’, ‘ore’, ‘re ‘, ‘e t’, ‘ th’, ‘tha’, ‘han’] ● Know your Ns: ○ ‘Unigram’ == 1-gram ○ ‘Bigram’ == 2-gram ○ ‘Trigram’ == 3-gram count_vectn = CountVectorizer(ngram_range =(2, 2))
  • 9.
  • 10.
    Term Frequencies ● TF:Term Frequency: ○ word count / (number of words in this document) ○ “How important (0 to 1) is this word to this document”? ● IDF: Inverse Document Frequency ○ 1 / (number of documents this word appears in) ○ “How common is this word in this corpus”? ● TFIDF: ○ TF * IDF
  • 11.
  • 12.
    Classifying Text Words area valid input to machine learning algorithms In this example, we’re using: ● Newsgroup emails as samples (‘rows’ in our input) ● Words in each email as features (‘columns’) ● Newsgroup ids as targets
  • 13.
    The 20newsgroups dataset fromsklearn.datasets import fetch_20newsgroups cats = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med'] twenty_train = fetch_20newsgroups( subset='train', categories=cats) twenty_test = fetch_20newsgroups(subset='test', categories=cats)
  • 14.
  • 15.
    Convert words toTFIDF scores from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer count_vect = CountVectorizer() X_train_counts = count_vect.fit_transform(twenty_train.data) tfidf_transformer = TfidfTransformer(use_idf=True) X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
  • 16.
    Fit your modelto the data from sklearn.naive_bayes import MultinomialNB nb_classifier = MultinomialNB().fit(X_train_tfidf, twenty_train.target)
  • 17.
    Test your model docs_test= ['God is love', 'OpenGL on the GPU is fast'] X_new_counts = count_vect.transform(docs_test) X_new_tfidf = tfidf_transformer.transform(X_new_counts) predicted = nb_classifier.predict(X_new_tfidf) for doc, category in zip(docs_test, predicted): print('{} => {}'.format(doc, twenty_train.target_names[category]))
  • 18.
    Text Clustering We canalso ‘cluster’ documents ● The ‘distance’ function is based on the words they have in common Common machine learning algorithms for text clustering include: ● Latent Semantic Analysis ● Latent Dirichlet Allocation
  • 19.
  • 20.
    Word colocation ● Createa graph (network visualisation) of words that appear together in documents ● Use network analysis (later session) to show which pairs of words are important in your documents
  • 21.
    Sentiment analysis ● Markdocuments (e.g. tweets) as having positive or negative sentiment ● Using machine learning ○ Training set: sentences, with ‘positive’/’negative’ for each sentence ● Using a sentiment dictionary ○ Positive or negative ‘score’ for each emotive word ○ Sentiment dictionaries can be used as machine learning algorithms ‘seeds’
  • 22.
    Named Entity Recognition ●Find the names of people, organisations, locations etc in text ● Can use these to create social graphs (networks showing how people etc connect to each other) and find ‘hubs’, ‘connectors’ etc
  • 23.
  • 24.
    Natural Language Processing ●Understanding the grammar and meaning of text ● Useful for, e.g. translation between languages ● Python library: NLTK
  • 25.
    Getting started withNLTK import nltk nltk.download()
  • 26.
    Get text readyfor NLTK processing from nltk import word_tokenize from nltk.text import Text fsipa = open('example_data/sipatext.txt', 'r') sipatext = fsipa.read() fsipa.close() sipawords = word_tokenize(sipatext) textlist = Text(sipawords)
  • 27.
  • 28.
    NLTK: word dispersionplots from nltk.book import * text2.dispersion_plot(['Elinor', 'Willoughby', 'Sophia'])
  • 29.
    NLTK: Word Meanings fromnltk.corpus import wordnet as wn word = 'class' synset = wn.synsets(word) print('Synset: {}n'.format(synset)) for i in range(len(synset)): print('Meaning {}: {} {}'.format(i, synset[i].lemma_names(), synset[i].definition()))
  • 30.
  • 31.
    NLTK: converting wordsinto logic from nltk import load_parser parser = load_parser('grammars/book_grammars/simple-sem.fcfg', trace=0) sentence = 'Angus gives a bone to every dog' tokens = sentence.split() for tree in parser.parse(tokens): print(tree.label()['SEM'])
  • 32.
  • 33.
    Exercises Try the codein the 7.x series notebooks

Editor's Notes

  • #4 Topic following: includes tracking things like hate speech (iHub Nairobi has done a lot of work on this topic) Verification: the Pheme project (http://www.pheme.eu/) is working on automatically tracking the veracity of stories.
  • #6 For speech recognition in python, try https://pypi.python.org/pypi/SpeechRecognition/ or speech http://code.activestate.com/recipes/579115-recognizing-speech-speech-to-text-with-the-python-/ We’re looking at two pieces of data today: the Wikipedia entry for SIPA, and a set of tweets about the #migrantcrisis, grabbed from the Twitter API by using notebook 3.1.
  • #8 Scikit-learn has some powerful text processing functions, including this one to separate text into words
  • #9 word n-grams; character n-grams
  • #10 Stopwords are common words (“the”, “a”, “and”) that don’t add to meaning, and might confuse outputs
  • #11 From http://stevenloria.com/finding-important-words-in-a-document-using-tf-idf/: If a word appears frequently in a document, it's important. Give the word a high score. But if a word appears in many documents, it's not a unique identifier. Give the word a low score.
  • #24 Aka computational linguistics
  • #32 More than you ever wanted to know about parsing sentences: http://www.nltk.org/howto/featgram.html Simple_sem is a simple grammar, just for teaching: its whole specification is at https://github.com/nltk/nltk_teach/blob/master/examples/grammars/book_grammars/simple-sem.fcfg