Spark parse JSON consisting of only array and integer

Question

I have a file that contains one line

[[1],[2,3]]

I think this is a valid json file and I want to read it in Spark, so I tried

df = spark.read.json('file:/home/spark/testSparkJson.json')
df.head()
Row(_corrupt_record=u'[[1],[2,3]]')

It seems to me that Spark failed to parse this file, and I want Spark to read it as Array of Array of Long in a column, so that I can have

df.head()
Row(sequence=[[1], [2, 3]])
df.printSchema()
root
 |-- sequence: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: long (containsNull = true)

how can I do this?

I'm using pyspark in Spark 2.1.0 now, any solution base on other language/previous versions are also welcome.

Mariusz · Accepted Answer · 2017-01-30 20:47:53Z

1

Spark requires every json line to have one json dictionary and you have array. If you change file content to:

{"sequence": [[1],[2,3]]}

then spark will create schema as you wanted:

>>> spark.read.json("/tmp/sample.json").printSchema()
root
 |-- sequence: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: long (containsNull = true)

answered Jan 30, 2017 at 20:47

Mariusz

14k3 gold badges66 silver badges66 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Spark parse JSON consisting of only array and integer

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related