From the course: Deep Learning with Python and Keras: Build a Model for Sentiment Analysis

Word vector encodings and word embeddings

- [Instructor] When you think about machine learning models, you know that ML models only process numeric data, and this is true of neural networks as well. So the big question when you're working with neural networks for sentiment analysis is how do we represent text in numeric form? Another way to put it, how do we express words as numbers so that ML models can understand them? Let's consider a document or text that we want to feed into a machine learning model for sentiment analysis. Here is a review of a restaurant. "The restaurant we visited last night was great, great food very good ambience." Now the first thing you do is reprocess the document.. So all of the stop words are removed and the document is entirely in a lower case. You might get rid of all of the punctuation as well. You then tokenize this document where you'd extract tokens, which could be entire words or portions of words. So let's for simplicity, assume that every word in this document becomes a separate token after tokenization. Once you have these individual tokens from the document, the next step is for you to figure out how to represent these tokens numerically. You now need to represent each token in some numeric form. So every bit of text or every document that you feed into your machine learning model is represented using a tensor of tokens where a tensor is just a multidimensional array. Then the next question to ask yourself is, what is the numeric representation of a token? So how do you get these tokens or words that you have extracted from the input documents to be represented using numbers? And it turns out that there are several techniques that can achieve this. Now, we'll only talk about a few of the techniques here, but there are a few other techniques that I'm not mentioning. The three techniques that we discuss include count vector encoding, TF-IDF encoding, and word embeddings. Let's start by understanding the first of these techniques, count vector encoding. If you're familiar with one-hot encoding, count vector encoding is just a modification of that. In count vector encoding, every document or a piece of text is represented as an array of numbers. In order to build up this array, you first need to create a vocabulary of tokens. Now, once again, I use the term words instead of tokens just to tell you what tokens are, but tokens need not be complete words, just something to keep in mind. Within the vocabulary of tokens, every token has a unique index position, which identifies that particular token, and every document is now represented by a tensor whose length is equal to the size of the vocabulary. So if you have 10,000 words in the vocabulary, the size of the tensor representing each document will be 10,000. The elements of this tensor correspond to how many times the token at that particular index occurs in that sentence. So if a word occurs twice, it'll have a count of two. If it occurs three times, it'll have a count of three. If it doesn't occur at all, the count will be zero. Words that are not present in a sentence will be represented by zero. Another technique that you can use to represent text in numeric form is to use TF-IDF encodings to represent the individual tokens. Once again, here you'd create a vocabulary of tokens, and every document will be represented by a tensor whose length is equal to vocabulary size. This is exactly similar to count vector encodings. The difference here is that the elements in the tensor that represent each document are TF-IDF scores for the individual tokens. Every token in the vocabulary has a TF-IDF score that's computed. So what's TF-IDF? It stands for term frequency inverse document frequency. The TF-IDF score is a combination of these two components. Words that occur more often in a single document tend to have a high term frequency score. For example, let's say the word awesome occurs twice or twice in a single document. That word will be upgrade due to the term frequency component. The vocabulary is built up using the corpus of documents that we use to train the model. Words that occur less often in the corpus have a high IDF score. The idea is that words that occur often across the document corpus tend to be common words and contain less information. Words that are infrequent contain more information. Both count vector and TF-IDF encodings suffer from one significant drawback. They do not capture the meaning and semantic relationships that exist in the words. Also, the feature vectors used to represent the individual documents in the corpus tend to be very, very large, the size of the vocabulary, and this is why these simple encodings are not used to train models in the real world. Instead, we use word embeddings. Now, what exactly are embeddings? Embeddings just represent words or tokens as vectors, and the numbers in those vectors capture the semantic similarity between words. The word embeddings for two words that are similar in meaning and context will be very close to one another. For example, if you were to generate embeddings for orange and apple, well both of them are fruits. Their dense vector embeddings will be numerically close to each other. Thus word embeddings allow machine learning models to consider the context in which words are used. Word embeddings thus provide a deeper understanding of word meanings, which is crucial for accurate sentiment analysis. Word embeddings are usually generated and found during the model training process. You can train an embedding layer to find word embeddings for your input tokens. However, there are machine learning models that make pre-trained word embeddings available to you, which you can use to train your model as well. And overall, using word embeddings tends to improve sentiment analysis accuracy.

Contents