0

I'm trying to get a list of every word, 2-word, and 3-word phrase used in a bunch of product reviews (200K+ reviews). The reviews are provided to me as json objects. I have attempted to remove as much data from memory as possible by using generators, but I'm still running out of memory and don't quite know where to go next. I reviewed the use of generators/iterators and a very similar problem here: repeated phrases in the text Python but I still can't get it to work for a large dataset (my code works well if I take a subset of the reviews).

The format (or at least intended format) of my code is as follows: -Read in the text file containing json objects line-by-line -parse the current line to a json object and pull out the review text (there is other data in the dict which I do not need) -break the review into component words, clean the words and then add them to my master list, or increment the counter of that word/phrase if it already exists

Any assistance would be greatly appreciated!

import json
import nltk
import collections

#define set of "stopwords", those that are removed
s_words=set(nltk.corpus.stopwords.words('english')).union(set(["it's", "us", " "]))

#load tokenizer, which will split text into words, and stemmer - which stems words
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
stemmer = nltk.SnowballStemmer('english')
master_wordlist = collections.defaultdict(int)
#open the raw data and read it in by line
allReviews = open('sample_reviews.json')
lines = allReviews.readlines()
allReviews.close()


#Get all of the words, 2 and 3 word phrases, in one review
def getAllWords(jsonObject):
    all_words = []
    phrase2 = []
    phrase3 = []

    sentences=tokenizer.tokenize(jsonObject['text'])
    for sentence in sentences:
        #split up the words and clean each word
        words = sentence.split()

        for word in words:
            adj_word = str(word).translate(None, '"""#$&*@.,!()-                     +?/[]1234567890\'').lower()
            #filter out stop words
            if adj_word not in s_words:

                all_words.append(str(stemmer.stem(adj_word)))

                #add all 2 word combos to list
                phrase2.append(str(word))
                if len(phrase2) > 2:
                    phrase2.remove(phrase2[0])
                if len(phrase2) == 2:
                    all_words.append(tuple(phrase2))

                #add all 3 word combos to list
                phrase3.append(str(word))
                if len(phrase3) > 3:
                    phrase3.remove(phrase3[0])
                if len(phrase3) == 3:
                    all_words.append(tuple(phrase3))

    return all_words
#end of getAllWords

#parse each line from the txt file to a json object
for c in lines:
    review = (json.loads(c))
    #counter instances of each unique word in wordlist
    for phrase in getAllWords(review):
        master_wordlist[phrase] += 1

1 Answer 1

1

i believe calling readlines loads the whole file into memory, there should be less overhead just to iterate over the file object line by line

#parse each line from the txt file to a json object
with open('sample_reviews.json') as f:
  for line in f:
    review = (json.loads(line))
    #counter instances of each unique word in wordlist
    for phrase in getAllWords(review):
        master_wordlist[phrase] += 1
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for the reply. I will re-write to remove the readlines. My assumption was that since the code doesn't bomb out until it's run for a while that I had some runaway memory issue after that. I'll try fixing that first.
@flyingmeatball where are your generators? can getAllWords yield a result instead of building and returning a list?
When I try and implement yield for getAllWords I get an 'unhashable type:list' error on the line master_wordlist[phrase] += 1 If you have any suggestions I'd be happy to listen.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.