3

I have a binary file and I want to extract all ascii characters while ignoring non-ascii ones. Currently I have:

with open(filename, 'rb') as fobj:
   text = fobj.read().decode('utf-16-le')
   file = open("text.txt", "w")
   file.write("{}".format(text))
   file.close

However I'm encountering an error when writing to file UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128). How would I get Python to ignore non-ascii?

2
  • 1
    Are you sure that the file does not have unicode characters within? Commented May 8, 2015 at 13:19
  • It looks like your input file is encoded as utf-16-le, so you should specify that encoding when you open the file. In Python 2 you need to use codecs.open, but in Python 3 you can use the normal built-in open Commented May 8, 2015 at 13:35

2 Answers 2

5

Use the built-in ASCII codec and tell it to ignore any errors, like:

with open(filename, 'rb') as fobj:
   text = fobj.read().decode('utf-16-le')
   file = open("text.txt", "w")
   file.write("{}".format(text.encode('ascii', 'ignore')))
   file.close()

You can test & play around with this in the Python interpreter:

>>> s = u'hello \u00a0 there'
>>> s
u'hello \xa0 there'

Just trying to convert to a string throws an exception.

>>> str(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 6: ordinal not in range(128)

...as does just trying to encode that unicode string to ASCII:

>>> s.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 6: ordinal not in range(128)

...but telling the codec to ignore the characters it can't handle works okay:

>>> s.encode('ascii', 'ignore')
'hello  there'
Sign up to request clarification or add additional context in comments.

3 Comments

Is there a predetermined range for what is Python considers ascii? Output is still picking up characters such as SOH,ACK (not sure what these are I'm just typing them as they appear in Sublime Text).
@VeraWang SOH and ACK are ASCII. The range is 0 to 127 and those are 1 and 6.
@VeraWang -- ASCII characters 0..31 are non-printable (including those two, see the charts on this wikipedia page about ASCII - en.wikipedia.org/wiki/ASCII#ASCII_printable_code_chart) Maybe more information on the actual problem you're trying to solve would be useful if this isn't giving you what you need...
3

Basically, the ASCII table takes value in range [0, 27) and associates them to (writable or not) characters. So, to ignore non-ASCII characters, you just have to ignore characters whose code isn't comprise in [0, 27), aka inferior or equal to 127.

In python, there is a function, called ord, which accordingly to the docstring

Return the integer ordinal of a one-character string.

In other words, it gives you the code of a character. Now, you must ignore all characters that, passed to ord, return a value greater than 128. This can be done by:

with open(filename, 'rb') as fobj:
    text = fobj.read().decode('utf-16-le')
    out_file = open("text.txt", "w")

    # Check every single character of `text`
    for character in text:
        # If it's an ascii character
        if ord(character) < 128:
            out_file.write(character)

    out_file.close

Now, if you just want to conserve printable characters, you must notice that all of them - in the ASCII table at least - are between 32 (space) and 126 (tilde), so you must simply do:

if 32 <= ord(character) <= 126:

7 Comments

So if I only wanted ASCII printable characters [32, 127] it's a simple ord(char) < 128 and ord(char) > 31?
@VeraWang Almost (127 isn't printable), although 31 < ord(char) < 127 is simpler.
@VeraWang That's almost it! You've forgotten that 127 is the DELETE character, not printable, so the interval is now the closed [32, 126]: ord(character) <= 126 and ord(character) >= 32
Or change to 32 <= ord(character) <= 126, as that's apparently what she wants anyway. That should be enough change then.
You keep doing that as if ord(character) >= 32 and ord(character) <= 126... why?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.