1

I am reading in words from a text file and comparing them to a set of words just to see how many times they appear in the sample document. For example, I have a text file and I want to see how many times the word "engineer" occurs.

The problem is that with my sample data, string comparison isn't working. I see that the program is indeed reading in what appears to be a word of <type 'str'> that looks like "engineer"; however, there is no match. When printing out the ASCII for each character in the word using ord(character), there appear to be 0's in between each character. The output for the string "engineer" then looks like the following:

0 101 0 110 0 103 0 105 0 110 0 101 0 101 0 114 0

Using strip() removes the beginning and the end 0's, but not the middle ones. Any thoughts on what format these strings are in and how I can fix it?

I am using Python 2.7.

3
  • 2
    Sounds like UTF-16. Commented Jun 11, 2018 at 17:13
  • Not sure why you're getting 0's in between each character, but maybe you could split on them and reassemble your string without the 0s? Commented Jun 11, 2018 at 17:13
  • 1
    show your code and file Commented Jun 11, 2018 at 17:14

2 Answers 2

2

This is UTF-16-BE encoding for the string engineer.1

UTF-16 uses two bytes for BMP characters (including ASCII characters), so, for example, the character e, which is Unicode (and ASCII) character number 101 (0x65 hex), shows up as the 16-bit code unit 101. In big-endian (that's what the -BE part means), the first byte is 0, and the second byte is 101. So, if your text is pure ASCII, your UTF-16 ends up looking like ASCII with an extra \0 byte before each character.


The cleanest way to solve this is to open the file as a Unicode file. As a general rule, if you decode everything to unicode as part of reading it, encode back to bytes only at the very end as part of writing it, and do all the work in the middle with unicode, everything is simpler.

In Python 2.7, there are two ways to do this, codecs.open or io.open. Using codecs makes your code a bit easier to port to Python 2.5, using io makes it a bit easier to port to 3.x, but it doesn't make a difference otherwise in simple cases like this.

Notice that your line strings will now be unicode instead of str, so ideally you'll want your set of search strings to also be unicode values.

d = {u'engineer': 0, u'conductor': 0, u'transit cop': 0}
with io.open(path, encoding='utf-16-be') as f:
    for line in f:
        try:
            d[line.strip()] += 1
        except KeyError:
            pass

Another alternative is to read the file as binary UTF-16-BE, and make your search strings UTF-16-BE-encoded str values:

d = {u'engineer': 0, u'conductor': 0, u'transit cop': 0}
d = {key.encode('utf-16-be'): count for key, count in d.items()}
with open(path) as f:
    for line in f:
        try:
            d[line.rstrip('\n\0')] += 1
        except KeyError:
            pass

Notice that I had to be careful with stripping, to make sure to remove the whole two-byte \0\n at the end instead of just the \n byte, and to not strip off the \0 byte at the start. This is just one of many ways that dealing with encoded bytes is more of a pain than dealing with Unicode. And if your final output is going to involve, say, printing these strings to your console or writing them out to a UTF-8 file, it'll get even more painful. If the final output is going to be another UTF-16-BE file, and if saving a bit of CPU is really important, it might be worth doing it this way. But otherwise, I'd go with the first.


1. Actually, you've got an extra \0 at the end. But presumably in your real data, that's actually the first byte of the next character—maybe a \n, which, in UTF-16-BE, of course looks like \0\n.

Sign up to request clarification or add additional context in comments.

Comments

-1

Looks like a job for the regex library https://docs.python.org/3/library/re.html. Match on a suitable regex to get the number of hits per line. Add em all up to get the file level:

pattern = re.compile("engine")
len(pattern.findall("engine engineers love engineering"))
>>>
3

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.