python regex not working on string encoding

Question

I have a string (g) on which I run a simple regex to find a number. The problem is that somehow the regex doesn't work on that string type (encoding?). A "normal" string works however. What am I missing? Please see below the steps as seen on the repl:

(example is a summary of an online poker tournament)

NOT WORKING:

>>> g
'F\x00u\x00l\x00l\x00 \x00T\x00i\x00l\x00t\x00 \x00P\x00o\x00k\x00e\x00r\x00 \x00T\x00o\x00u\x00r\x00n\x00a\x00m\x00e\x00n\x00t\x00 \x00S\x00u\x00m\x00m\x00a\x00r\x00y\x00 \x00$\x002\x00.\x002\x005\x00 \x00H\x00e\x00a\x00d\x00s\x00-\x00U\x00p\x00 \x00S\x00i\x00t\x00 \x00&\x00 \x00G\x00o\x00 \x00(\x002\x005\x000\x005\x005\x005\x009\x001\x004\x00)\x00 \x002\x00-\x007\x00 \x00T\x00r\x00i\x00p\x00l\x00e\x00 \x00D\x00r\x00a\x00w\x00 \x00L\x00i\x00m\x00i\x00t\x00 \x00(\x00T\x00u\x00r\x00b\x00o\x00,\x00 \x00H\x00e\x00a\x00d\x00s\x00 \x00U\x00p\x00)\x00\n\x00B\x00u\x00y\x00-\x00I\x00n\x00:\x00 \x00$\x002\x00.\x001\x002\x00 \x00+\x00 \x00$\x000\x00.\x001\x003\x00\n\x00B\x00u\x00y\x00-\x00I\x00n\x00 \x00C\x00h\x00i\x00p\x00s\x00:\x00 \x001\x005\x000\x000\x00\n\x002\x00 \x00E\x00n\x00t\x00r\x00i\x00e\x00s\x00\n\x00T\x00o\x00t\x00a\x00l\x00 \x00P\x00r\x00i\x00z\x00e\x00 \x00P\x00o\x00o\x00l\x00:\x00 \x00$\x004\x00.\x002\x004\x00\n\x00T\x00o\x00u\x00r\x00n\x00a\x00m\x00e\x00n\x00t\x00 \x00s\x00t\x00a\x00r\x00t\x00e\x00d\x00:\x00 \x002\x000\x001\x003\x00/\x000\x003\x00/\x000\x008\x00 \x000\x006\x00:\x000\x000\x00:\x002\x007\x00 \x00E\x00T\x00\n\x00T\x00o\x00u\x00r\x00n\x00a\x00m\x00e\x00n\x00t\x00 \x00f\x00i\x00n\x00i\x00s\x00h\x00e\x00d\x00:\x00 \x002\x000\x001\x003\x00/\x000\x003\x00/\x000\x008\x00 \x000\x006\x00:\x001\x004\x00:\x003\x000\x00 \x00E\x00T\x00\n\x00\n\x001\x00:\x00 \x00A\x00n\x00d\x00r\x00e\x00y\x003\x003\x001\x000\x00,\x00 \x00$\x004\x00.\x002\x004\x00\n\x002\x00:\x00 \x00s\x00y\x00n\x00t\x00h\x00e\x00s\x00i\x00i\x00s\x00\n\x00s\x00y\x00n\x00t\x00h\x00e\x00s\x00i\x00i\x00s\x00 \x00f\x00i\x00n\x00i\x00s\x00h\x00e\x00d\x00 \x00i\x00n\x00 \x002\x00n\x00d\x00 \x00p\x00l\x00a\x00c\x00e'
>>> myre = re.compile(u"""\(([0-9]+)\)""",re.UNICODE)
>>> m = myre.search(g)
>>> m.groups()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'groups'

WORKING

>>> g="Full Tilt Poker Tournament Summary $2.25 Heads-Up Sit & Go (250555914) 2-7 Triple Draw Limit (Turbo, Heads Up)"
>>> m = myre.search(g)
>>> m.groups()
('250555914',)

Martijn Pieters · Accepted Answer · 2013-03-11 17:59:22Z

5

You have UTF-16 encoded data, albeit without a BOM (Byte Order Mark). Decode to Unicode first before attempting to match the regex:

>>> g[:-1].decode('utf-16-le')
u'Full Tilt Poker Tournament Summary $2.25 Heads-Up Sit & Go (250555914) 2-7 Triple Draw Limit (Turbo, Heads Up)\nBuy-In: $2.12 + $0.13\nBuy-In Chips: 1500\n2 Entries\nTotal Prize Pool: $4.24\nTournament started: 2013/03/08 06:00:27 ET\nTournament finished: 2013/03/08 06:14:30 ET\n\n1: Andrey3310, $4.24\n2: synthesiis\nsynthesiis finished in 2nd plac'
>>> myre.search(g[:-1].decode('utf-16-le')).groups()
(u'250555914',)

I had to remove the last byte to make that decode though, a null byte at the end was missing. If you are missing data from the end, you are most likely also missing the data from the start, where the BOM would be located. The BOM tells the decoder what variant of UTF-16 was used to encode (little or big endian), without it we need to explicitly tell Python to decode this as little endian.

If you decode the full data, including the BOM, you could just use .decode('utf-16') instead.

If you are reading this from a file, use codecs.open() instead and have Python decode it to Unicode for you:

import codecs

for line in codecs.open('filename.txt', 'r', encoding='utf16'):
    # handle line

because otherwise things like .readlines() splits newlines at the byte level, which are encoded to two bytes just like everything else in UTF-16.

edited Mar 11, 2013 at 17:59

answered Mar 11, 2013 at 15:12

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

jule64 Over a year ago

Thank you @Martijn Pieters, I still don't understand why I get this format when I open(file).readlines() instead of the normal utf-16 format. Would you be able to help with that point?

Martijn Pieters Over a year ago

Don't use readlines() on a UTF-16-encoded file! Newlines are encoded as two bytes too, and you are splitting the file now. Use codecs.open() and read Unicode data.

Collectives™ on Stack Overflow

python regex not working on string encoding

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related