I have a PySpark dataframe where a column is in string type whereas the string is a 2D array/list which needs to be exploded into rows. However, since it is not a Struct/Array Type it's not possible to use explode directly.
This can be seen in the example below:
a = [('Bob', 562,"Food", "[[29,June,2018],[12,May,2018]]"), ('Bob',880,"Food","[[01,June,2018]]"), ('Bob',380,'Household',"[[16,June,2018]]")]
df = spark.createDataFrame(a, ["Person", "Amount","Budget", "Date"])
df.printSchema()
Output:
root
|-- Person: string (nullable = true)
|-- Amount: long (nullable = true)
|-- Budget: string (nullable = true)
|-- Date: string (nullable = true)
The output I am looking for is seen below. I need to be able to convert the String to a Struct/Array so that I can explode:
+------+------+---------+---+-----+-----+
|Person|Amount|Budget |Day|Month|Year |
+------+------+---------+---+-----+-----+
|Bob |562 |Food |29 |June |2018 |
|Bob |562 |Food |12 |May |2018 |
|Bob |880 |Food |01 |June |2018 |
|Bob |380 |Household|16 |June |2018 |
+------+------+---------+---+-----+-----+