0

I am writing a program for counting the approximate number of words in the file and getting an error stating 'ascii' codec can't decode byte.

How can I eliminate this error?

Below is the traceback of above error:

Traceback (most recent call last):
  File "/Users/NikolaMac/Desktop/alice.py", line 23, in <module>
    contents = f_obj.read()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)"

Here is my code:

filename='alice.txt'

try:
    with open(filename) as f_obj:
        contents = f_obj.read()

except FileNotFoundError:
    msg = "Sorry, the file " + filename + " does not exist."
    print(msg)

else:
    # Count the approximate number of words in the file.
    words = contents.split()
    num_words = len(words)
    print("The file " + filename + " has about " + str(num_words) + " words.")
1
  • @MylesHollowed The error message shows Python 3.6. Commented Sep 12, 2018 at 1:19

3 Answers 3

2

You need to use the io.open function instead, and pass it an encoding.

Try this:

import io

with io.open(filename, encoding='utf-8') as f_obj:
    contents = f_obj.read()

print('Words: %d'%len(contents.split(' ')))
Sign up to request clarification or add additional context in comments.

2 Comments

Can I do anything in Terminal to works this encoding without additional code?
What are you trying to do?
1

The error message says that it tries to use ASCII decoding. You may need to specify a different encoding.

The only part of your program I can see where encoding can come in is the open call. According to the docs, if you don't pass in an encoding explicitly,

The default encoding is platform dependent (whatever locale.getpreferredencoding() returns)

Try passing in encoding='utf-8' to the open call.

Comments

0

I believe every encoding out there encodes the space character as 0x20 (out of experience, not with solid evidence). If all you need to do is count words, you can skip the decoding process by checking the number of 0x20 bytes in the file, then add 1 to it. This simple method will get you an approximate.

With that method, you might have to consider subtracting the number of spaces at the very beginning or end of the file, since that means there is no word surrounding that space. UTF-16 encodes space as 0x20 0x00 so there might be a null byte at the beginning or end of the file if the document starts or ends with a space. Also some encodings put a byte order mark at the beginning of the file, in which case the text doesn't start from the beginning.

You can't use regex with this method so it will not work if you want to parse documents in non-latin based languages.

2 Comments

I don't think this is a very good approach... what about consecutive spaces or other whitespace tokens?
@robinsax Good point. Consecutive spaces can be taken into account. I don't know what whitespace tokens are though.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.