0

I'm currently working with a data set containing raw text which I should pre-process:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
from nltk.stem import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer

stop_words = set(stopwords.words('english'))
stemmer = SnowballStemmer('english')
lemma = WordNetLemmatizer()

from autocorrect import spell

for df in [train_df, test_df]:
    df['comment_text'] = df['comment_text'].apply(lambda x: word_tokenize(str(x)))
    df['comment_text'] = df['comment_text'].apply(lambda x: [lemma.lemmatize(spell(word)) for word in x])
    df['comment_text'] = df['comment_text'].apply(lambda x: ' '.join(x))

Including the spell function, however, rises the memory usage til a point that I get a "Memory error". This doesn't happen without the usage of such function. I'm wondering if there is a way to optimize this process keeping the spell function (the data set has lots of misspelled words).

enter image description here

2 Answers 2

2

I haven't got access to your dataframe so this is a bit speculative, but here goes...

DataFrame.apply will run the lambda function on the whole column at once, so it is probably holding the progress in memory. Instead, you could convert the lambda function into a pre-defined function and use DataFrame.map instead, which applies the function element-wise instead.

def spellcheck_string(input_str):
    return [lemma.lemmatize(spell(word)) for word in x]

for df in [train_df, test_df]:
   # ...
    df['comment_text'] = df['comment_text'].map(spellcheck_string)
   # ...

Could you give this a try and see if it helps?

Sign up to request clarification or add additional context in comments.

7 Comments

Oh great, didn't know about this! I'll try it out! Thank you.
The memory is reaching 11 GB again. 13 now. I tried once and my PC froze.
How big is your dataset? Approximate number of items in column, and typical size of the ["comment_text"] field? I would love to help based on some dummy data. It may be that operating on the data in batches is the way forwards.
I'm playing with this dataset: kaggle.com/c/jigsaw-toxic-comment-classification-challenge. Thank you for you kindness. I'm running the code another time and I can check the lines without interrupt it.
*cant check the number of lines. My approach now is parallelizing it.
|
1

Anyway, I would work with dask, you can divide your dataframe in chunks (divisions) and you can retrieve each part and work with it.

https://dask.pydata.org/en/latest/dataframe.html

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.