Optimizing memory usage - Pandas/Python

Question

I'm currently working with a data set containing raw text which I should pre-process:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
from nltk.stem import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer

stop_words = set(stopwords.words('english'))
stemmer = SnowballStemmer('english')
lemma = WordNetLemmatizer()

from autocorrect import spell

for df in [train_df, test_df]:
    df['comment_text'] = df['comment_text'].apply(lambda x: word_tokenize(str(x)))
    df['comment_text'] = df['comment_text'].apply(lambda x: [lemma.lemmatize(spell(word)) for word in x])
    df['comment_text'] = df['comment_text'].apply(lambda x: ' '.join(x))

Including the spell function, however, rises the memory usage til a point that I get a "Memory error". This doesn't happen without the usage of such function. I'm wondering if there is a way to optimize this process keeping the spell function (the data set has lots of misspelled words).

Phil Sheard · Accepted Answer · 2018-02-28 14:08:06Z

2

I haven't got access to your dataframe so this is a bit speculative, but here goes...

DataFrame.apply will run the lambda function on the whole column at once, so it is probably holding the progress in memory. Instead, you could convert the lambda function into a pre-defined function and use DataFrame.map instead, which applies the function element-wise instead.

def spellcheck_string(input_str):
    return [lemma.lemmatize(spell(word)) for word in x]

for df in [train_df, test_df]:
   # ...
    df['comment_text'] = df['comment_text'].map(spellcheck_string)
   # ...

Could you give this a try and see if it helps?

answered Feb 28, 2018 at 14:08

Phil Sheard

2,1621 gold badge18 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

pceccon Over a year ago

Oh great, didn't know about this! I'll try it out! Thank you.

pceccon Over a year ago

The memory is reaching 11 GB again. 13 now. I tried once and my PC froze.

Phil Sheard Over a year ago

How big is your dataset? Approximate number of items in column, and typical size of the ["comment_text"] field? I would love to help based on some dummy data. It may be that operating on the data in batches is the way forwards.

pceccon Over a year ago

I'm playing with this dataset: kaggle.com/c/jigsaw-toxic-comment-classification-challenge. Thank you for you kindness. I'm running the code another time and I can check the lines without interrupt it.

pceccon Over a year ago

*cant check the number of lines. My approach now is parallelizing it.

|

Julio CamPlaz · Accepted Answer · 2018-02-28 14:30:11Z

1

Anyway, I would work with dask, you can divide your dataframe in chunks (divisions) and you can retrieve each part and work with it.

https://dask.pydata.org/en/latest/dataframe.html

answered Feb 28, 2018 at 14:30

Julio CamPlaz

9178 silver badges20 bronze badges

Collectives™ on Stack Overflow

Optimizing memory usage - Pandas/Python

2 Answers 2

7 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related