I'm currently working with a data set containing raw text which I should pre-process:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
stop_words = set(stopwords.words('english'))
stemmer = SnowballStemmer('english')
lemma = WordNetLemmatizer()
from autocorrect import spell
for df in [train_df, test_df]:
df['comment_text'] = df['comment_text'].apply(lambda x: word_tokenize(str(x)))
df['comment_text'] = df['comment_text'].apply(lambda x: [lemma.lemmatize(spell(word)) for word in x])
df['comment_text'] = df['comment_text'].apply(lambda x: ' '.join(x))
Including the spell function, however, rises the memory usage til a point that I get a "Memory error". This doesn't happen without the usage of such function. I'm wondering if there is a way to optimize this process keeping the spell function (the data set has lots of misspelled words).
