0

I have a text file, its size is more than 200 MB. I want to read it and then want to select 30 most frequently used words. When i run it, it give me error. The code is as under:-

    import sys, string 
    import codecs 
    from collections import Counter
    import collections
    import unicodedata
    with open('E:\\Book\\1800.txt', "r", encoding='utf-8') as File_1800:
    for line in File_1800: 
       sepFile_1800 = line.lower()
        words_1800 = re.findall('\w+', sepFile_1800)
    for wrd_1800 in [words_1800]:
        long_1800=[w for w in wrd_1800 if len(w)>3]
        common_words_1800 = dict(Counter(long_1800).most_common(30))
    print(common_words_1800)


    Traceback (most recent call last):
    File "C:\Python34\CommonWords.py", line 14, in <module>
    for line in File_1800:
    File "C:\Python34\lib\codecs.py", line 313, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position    
    3784: invalid start byte
1
  • 1
    Not sure if it is the same in your actual code, but your indentation is off. Commented Sep 17, 2015 at 6:13

2 Answers 2

1

The file does not contain 'UTF-8' encoded data. Find the correct encoding and update the line: with open('E:\\Book\\1800.txt', "r", encoding='correct_encoding')

Sign up to request clarification or add additional context in comments.

5 Comments

Can you tell me how to find the correct encoding? Actually i am new to Python.
You can use Notepad++ editor to determine the encoding. It usually gets it right, but it's not a 100%
You can use Notepad++ editor to determine the encoding. It usually gets it right although it's not a 100%. You can also try some popular options like ` ISO-8859-1`.
I tried "ISO-8859-1" it gives me this result.... "{'subscribe': 1, 'email': 1, 'ebooks': 1, 'newsletter': 1, 'hear': 1, 'about': 1}"... This file contain more than 90000000 words. I tried Notepad++ Open the file in Notepad++, click on "Encoding" and it shows "Encoded in ANSI".
Well the Encoded in ANSI part suggests it is ANSI encoded format. It is also referred to as Windows-1252 or CP-1252` (which you can try using).
0

Try encoding='latin1' instead of utf-8

Also, in these lines:

for line in File_1800:
    sepFile_1800 = line.lower()
    words_1800 = re.findall('\w+', sepFile_1800)
for wrd_1800 in [words_1800]:
    ...

The script is re-assigning the matches of re.findall to the words_1800 variable for every line. So when you get to for wrd_1800 in [words_1800], the words_1800 variable only has matches from the very last line.

If you want to make minimal changes, initialize an empty list before iterating through the file:

words_1800 = []

And then add the matches for each line to the list, rather than replacing the list:

words_1800.extend(re.findall('\w+', sepFile_1800))

Then you can do (without the second for loop):

long_1800 = [w for w in words_1800 if len(w) > 3]
common_words_1800 = dict(Counter(long_1800).most_common(30))
print(common_words_1800)

4 Comments

the result is... {'ebooks': 1, 'hear': 1, 'subscribe': 1, 'email': 1, 'newsletter': 1, 'about': 1}... This file contains more than 90000000 words.
Oh I just meant to fix the UnicodeDecodeError - I updated the answer with some comments on your code.
Thanks. it worked but not fully. For a 60 MB file it worked but for the other file (300 MB) it gives me Error. ....................................."Traceback (most recent call last): File "C:\Python34\CommonWords.py", line 17, in <module> words_1800.extend(re.findall('\w+', sepFile_1800)) MemoryError......"
There are a few changes you can make for your code to be more efficient. You could post a new question, since that's a different topic.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.