How to solve "UnicodeDecodeError: 'ascii' codec can't decode byte"

Question

I am writing a program for counting the approximate number of words in the file and getting an error stating 'ascii' codec can't decode byte.

How can I eliminate this error?

Below is the traceback of above error:

Traceback (most recent call last):
  File "/Users/NikolaMac/Desktop/alice.py", line 23, in <module>
    contents = f_obj.read()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)"

Here is my code:

filename='alice.txt'

try:
    with open(filename) as f_obj:
        contents = f_obj.read()

except FileNotFoundError:
    msg = "Sorry, the file " + filename + " does not exist."
    print(msg)

else:
    # Count the approximate number of words in the file.
    words = contents.split()
    num_words = len(words)
    print("The file " + filename + " has about " + str(num_words) + " words.")

@MylesHollowed The error message shows Python 3.6.

leewz
– leewz

2018-09-12 01:19:27 +00:00
Commented Sep 12, 2018 at 1:19 — leewz
– leewz, Commented Sep 12, 2018 at 1:19

robinsax · Accepted Answer · 2018-09-12 05:05:35Z

2

You need to use the io.open function instead, and pass it an encoding.

Try this:

import io

with io.open(filename, encoding='utf-8') as f_obj:
    contents = f_obj.read()

print('Words: %d'%len(contents.split(' ')))

answered Sep 12, 2018 at 5:05

robinsax

1,2207 silver badges9 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

John988 Over a year ago

Can I do anything in Terminal to works this encoding without additional code?

robinsax Over a year ago

What are you trying to do?

leewz · Accepted Answer · 2018-09-12 01:25:45Z

1

The error message says that it tries to use ASCII decoding. You may need to specify a different encoding.

The only part of your program I can see where encoding can come in is the open call. According to the docs, if you don't pass in an encoding explicitly,

The default encoding is platform dependent (whatever locale.getpreferredencoding() returns)

Try passing in encoding='utf-8' to the open call.

answered Sep 12, 2018 at 1:25

leewz

3,3661 gold badge23 silver badges44 bronze badges

Comments

MakotoE · Accepted Answer · 2018-09-12 03:12:05Z

0

I believe every encoding out there encodes the space character as 0x20 (out of experience, not with solid evidence). If all you need to do is count words, you can skip the decoding process by checking the number of 0x20 bytes in the file, then add 1 to it. This simple method will get you an approximate.

With that method, you might have to consider subtracting the number of spaces at the very beginning or end of the file, since that means there is no word surrounding that space. UTF-16 encodes space as 0x20 0x00 so there might be a null byte at the beginning or end of the file if the document starts or ends with a space. Also some encodings put a byte order mark at the beginning of the file, in which case the text doesn't start from the beginning.

You can't use regex with this method so it will not work if you want to parse documents in non-latin based languages.

edited Sep 12, 2018 at 3:12

answered Sep 12, 2018 at 2:53

MakotoE

2,1642 gold badges24 silver badges44 bronze badges

2 Comments

robinsax Over a year ago

I don't think this is a very good approach... what about consecutive spaces or other whitespace tokens?

MakotoE Over a year ago

@robinsax Good point. Consecutive spaces can be taken into account. I don't know what whitespace tokens are though.

Collectives™ on Stack Overflow

How to solve "UnicodeDecodeError: 'ascii' codec can't decode byte"

3 Answers 3

2 Comments

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related