Can't parse simple json with python

Question

I have a very simple json I can't parse with simplejson module. Reproduction:

import simplejson as json
json.loads(r'{"translatedatt1":"Vari\351es"}')

Result:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/pymodules/python2.5/simplejson/__init__.py", line 307, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/pymodules/python2.5/simplejson/decoder.py", line 335, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/pymodules/python2.5/simplejson/decoder.py", line 351, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Invalid \escape: line 1 column 23 (char 23)

Anyone has an idea what's wrong and how to parse the json above correctly?

The string that is encoded there is: Variées

P.S. I use python 2.5

Thanks a lot!

Martijn Pieters · Accepted Answer · 2013-02-03 15:44:40Z

8

That would be quite correct; Vari\351es contains an invalid escape, the JSON standard does not allow for a \ followed by just numbers.

Whatever produced that code should be fixed. If that is impossible, you'll need to use a regular expression to either remove those escapes, or replace them with valid escapes.

If we interpret the 351 number as an octal number, that would point to the unicode code point U+00E9, the é character (LATIN SMALL LETTER E WITH ACUTE). You can 'repair' your JSON input with:

import re

invalid_escape = re.compile(r'\\[0-7]{1,6}')  # up to 6 digits for codepoints up to FFFF

def replace_with_codepoint(match):
    return unichr(int(match.group(0)[1:], 8))


def repair(brokenjson):
    return invalid_escape.sub(replace_with_codepoint, brokenjson)

Using repair() your example can be loaded:

>>> json.loads(repair(r'{"translatedatt1":"Vari\351es"}'))
{u'translatedatt1': u'Vari\xe9es'}

You may need to adjust the interpretation of the codepoints; I choose octal (because Variées is an actual word), but you need to test this more with other codepoints.

edited Feb 3, 2013 at 15:44

answered Feb 3, 2013 at 15:34

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

diemacht Over a year ago

This code was produced for/by Venda platform. Unfortunately, I can't change this behavior. BTW - what would be a valid escaper?

diemacht Over a year ago

Thanks, but the result is not what's supposed to be: after rapair function we get "Varişes", while it should have been "Variées"

Martijn Pieters Over a year ago

@diemacht: Updated already :-) Since you didn't specify what you expected, I had to guess, then updated my guess.

bikeshedder · Accepted Answer · 2013-02-03 15:42:33Z

4

You probably did not intend to use a raw string, but a unicode string?

>>> import simplejson as json
>>> json.loads(u'{"translatedatt1":"Vari\351es"}')
{u'translatedatt1': u'Vari\xe9es'}

If you want to quote the data inside the JSON string you need to use \uNNNN:

>>> json.loads(r'{"translatedatt1":"Vari\u351es"}')
{'translatedatt1': u'Vari\u351es'}

Please note that the resulting dict is slightly different in this case. When parsing a unicode string simplejson uses unicode strings for the keys. Otherwise it uses byte string keys.

If your JSON data does in fact use \351e than it is simply broken and no valid JSON.

edited Feb 3, 2013 at 15:42

answered Feb 3, 2013 at 15:34

bikeshedder

7,5471 gold badge28 silver badges30 bronze badges

4 Comments

diemacht Over a year ago

Can I do it if the string is in some variable, for example: s=r'{"translatedatt1":"Vari\351es"}' ? Thanks!!!

bikeshedder Over a year ago

Just don't create the string that way. Get rid of the r prefix and use u instead if you want to create a string containing unicode data. If you really want to use quoting inside the JSON data you need to use \u351e.

Martijn Pieters Over a year ago

@bikeshedder: I think the OP means that the server sent that data. The r'' makes it easier to show us the raw data sent. Yes, that's broken JSON data...

Martijn Pieters Over a year ago

U+351E is CJK Ideograph Extension A.. not certain that that is the correct interpretation either. I think you were closer with octal..

Collectives™ on Stack Overflow

Can't parse simple json with python

2 Answers 2

3 Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related