Python convert binary file into string while ignoring non-ascii characters

Question

I have a binary file and I want to extract all ascii characters while ignoring non-ascii ones. Currently I have:

with open(filename, 'rb') as fobj:
   text = fobj.read().decode('utf-16-le')
   file = open("text.txt", "w")
   file.write("{}".format(text))
   file.close

However I'm encountering an error when writing to file UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128). How would I get Python to ignore non-ascii?

Are you sure that the file does not have unicode characters within? — dawg
– dawg, Commented May 8, 2015 at 13:19
It looks like your input file is encoded as utf-16-le, so you should specify that encoding when you open the file. In Python 2 you need to use codecs.open, but in Python 3 you can use the normal built-in open — PM 2Ring
– PM 2Ring, Commented May 8, 2015 at 13:35

MagTun · Accepted Answer · 2018-12-09 23:15:57Z

5

Use the built-in ASCII codec and tell it to ignore any errors, like:

with open(filename, 'rb') as fobj:
   text = fobj.read().decode('utf-16-le')
   file = open("text.txt", "w")
   file.write("{}".format(text.encode('ascii', 'ignore')))
   file.close()

You can test & play around with this in the Python interpreter:

>>> s = u'hello \u00a0 there'
>>> s
u'hello \xa0 there'

Just trying to convert to a string throws an exception.

>>> str(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 6: ordinal not in range(128)

...as does just trying to encode that unicode string to ASCII:

>>> s.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 6: ordinal not in range(128)

...but telling the codec to ignore the characters it can't handle works okay:

>>> s.encode('ascii', 'ignore')
'hello  there'

edited Dec 9, 2018 at 23:15

MagTun

6,2458 gold badges67 silver badges115 bronze badges

answered May 8, 2015 at 13:19

bgporter

36.9k8 gold badges65 silver badges67 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Helen Che Over a year ago

Is there a predetermined range for what is Python considers ascii? Output is still picking up characters such as SOH,ACK (not sure what these are I'm just typing them as they appear in Sublime Text).

Stefan Pochmann Over a year ago

@VeraWang SOH and ACK are ASCII. The range is 0 to 127 and those are 1 and 6.

bgporter Over a year ago

@VeraWang -- ASCII characters 0..31 are non-printable (including those two, see the charts on this wikipedia page about ASCII - en.wikipedia.org/wiki/ASCII#ASCII_printable_code_chart) Maybe more information on the actual problem you're trying to solve would be useful if this isn't giving you what you need...

Spirine · Accepted Answer · 2015-05-08 13:48:08Z

3

Basically, the ASCII table takes value in range [0, 2⁷) and associates them to (writable or not) characters. So, to ignore non-ASCII characters, you just have to ignore characters whose code isn't comprise in [0, 2⁷), aka inferior or equal to 127.

In python, there is a function, called ord, which accordingly to the docstring

Return the integer ordinal of a one-character string.

In other words, it gives you the code of a character. Now, you must ignore all characters that, passed to ord, return a value greater than 128. This can be done by:

with open(filename, 'rb') as fobj:
    text = fobj.read().decode('utf-16-le')
    out_file = open("text.txt", "w")

    # Check every single character of `text`
    for character in text:
        # If it's an ascii character
        if ord(character) < 128:
            out_file.write(character)

    out_file.close

Now, if you just want to conserve printable characters, you must notice that all of them - in the ASCII table at least - are between 32 (space) and 126 (tilde), so you must simply do:

if 32 <= ord(character) <= 126:

edited May 8, 2015 at 13:48

answered May 8, 2015 at 13:25

Spirine

1,8772 gold badges16 silver badges28 bronze badges

7 Comments

Helen Che Over a year ago

So if I only wanted ASCII printable characters [32, 127] it's a simple ord(char) < 128 and ord(char) > 31?

Stefan Pochmann Over a year ago

@VeraWang Almost (127 isn't printable), although 31 < ord(char) < 127 is simpler.

Spirine Over a year ago

@VeraWang That's almost it! You've forgotten that 127 is the DELETE character, not printable, so the interval is now the closed [32, 126]: ord(character) <= 126 and ord(character) >= 32

Stefan Pochmann Over a year ago

Or change to 32 <= ord(character) <= 126, as that's apparently what she wants anyway. That should be enough change then.

Stefan Pochmann Over a year ago

You keep doing that as if ord(character) >= 32 and ord(character) <= 126... why?

|

Collectives™ on Stack Overflow

Python convert binary file into string while ignoring non-ascii characters

2 Answers 2

3 Comments

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related