1

I have a PySpark dataframe where a column is in string type whereas the string is a 2D array/list which needs to be exploded into rows. However, since it is not a Struct/Array Type it's not possible to use explode directly.

This can be seen in the example below:

a = [('Bob', 562,"Food", "[[29,June,2018],[12,May,2018]]"), ('Bob',880,"Food","[[01,June,2018]]"), ('Bob',380,'Household',"[[16,June,2018]]")]
df = spark.createDataFrame(a, ["Person", "Amount","Budget", "Date"])

df.printSchema()

Output:

root
 |-- Person: string (nullable = true)
 |-- Amount: long (nullable = true)
 |-- Budget: string (nullable = true)
 |-- Date: string (nullable = true)

The output I am looking for is seen below. I need to be able to convert the String to a Struct/Array so that I can explode:

+------+------+---------+---+-----+-----+
|Person|Amount|Budget   |Day|Month|Year |
+------+------+---------+---+-----+-----+
|Bob   |562   |Food     |29 |June |2018 |
|Bob   |562   |Food     |12 |May  |2018 |
|Bob   |880   |Food     |01 |June |2018 |
|Bob   |380   |Household|16 |June |2018 |
+------+------+---------+---+-----+-----+
0

2 Answers 2

2

Start with removing the outer [[ and ]] and splitting the string on all ],[. After splitting the data will be in an array, making it possible to use the explode function. After that, what is left is simply to format the data into the desired output using another split and getItem.

It can be done as follows:

from pyspark.sql import functions as F

df.withColumn('date_arr', F.split(F.regexp_replace('Date', '\[\[|\]\]', ''), ','))\
  .withColumn('date_arr', F.explode('date_arr'))\
  .withColumn('date_arr', F.split('date_arr', ','))\
  .select('Person',
          'Amount',
          'Budget',
          'date_arr'.getItem(0).alias('Day'),
          'date_arr'.getItem(0).alias('Month'),
          'date_arr'.getItem(0).alias('Year'))
Sign up to request clarification or add additional context in comments.

Comments

0

If you have a string in the same format you showed us, then you need to do perform the tasks :

  1. Split the columns in as many array you have with explode
  2. Remove the [ and ]
  3. Split the date by , comma
  4. assign each member of the new array to its column.

If I write you the code :

from pyspark.sql import functions as F

df.select("Person",
          "Amount",
          "Budget",
          F.explode(F.split("Date", '],')).alias("date_array")
         ).select("Person",
                  "Amount",
                  "Budget",
                  F.split(F.translate(F.translate("date_array", '[', ''), ']', ''), ',').alias("date_array")
                 ).select("Person",
                          "Amount",
                          "Budget",
                          F.col("date_array").getItem(0).alias("Day"),
                          F.col("date_array").getItem(1).alias("Month"),
                          F.col("date_array").getItem(2).alias("Year"),
                         ).show()

+------+------+---------+---+-----+----+
|Person|Amount|   Budget|Day|Month|Year|
+------+------+---------+---+-----+----+
|   Bob|   562|     Food| 29| June|2018|
|   Bob|   562|     Food| 12|  May|2018|
|   Bob|   880|     Food| 01| June|2018|
|   Bob|   380|Household| 16| June|2018|
+------+------+---------+---+-----+----+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.