Convert String to ArrayType in column and explode

Question

I have a PySpark dataframe where a column is in string type whereas the string is a 2D array/list which needs to be exploded into rows. However, since it is not a Struct/Array Type it's not possible to use explode directly.

This can be seen in the example below:

a = [('Bob', 562,"Food", "[[29,June,2018],[12,May,2018]]"), ('Bob',880,"Food","[[01,June,2018]]"), ('Bob',380,'Household',"[[16,June,2018]]")]
df = spark.createDataFrame(a, ["Person", "Amount","Budget", "Date"])

df.printSchema()

Output:

root
 |-- Person: string (nullable = true)
 |-- Amount: long (nullable = true)
 |-- Budget: string (nullable = true)
 |-- Date: string (nullable = true)

The output I am looking for is seen below. I need to be able to convert the String to a Struct/Array so that I can explode:

+------+------+---------+---+-----+-----+
|Person|Amount|Budget   |Day|Month|Year |
+------+------+---------+---+-----+-----+
|Bob   |562   |Food     |29 |June |2018 |
|Bob   |562   |Food     |12 |May  |2018 |
|Bob   |880   |Food     |01 |June |2018 |
|Bob   |380   |Household|16 |June |2018 |
+------+------+---------+---+-----+-----+

Shaido · Accepted Answer · 2018-02-20 02:08:09Z

2

Start with removing the outer [[ and ]] and splitting the string on all ],[. After splitting the data will be in an array, making it possible to use the explode function. After that, what is left is simply to format the data into the desired output using another split and getItem.

It can be done as follows:

from pyspark.sql import functions as F

df.withColumn('date_arr', F.split(F.regexp_replace('Date', '\[\[|\]\]', ''), ','))\
  .withColumn('date_arr', F.explode('date_arr'))\
  .withColumn('date_arr', F.split('date_arr', ','))\
  .select('Person',
          'Amount',
          'Budget',
          'date_arr'.getItem(0).alias('Day'),
          'date_arr'.getItem(0).alias('Month'),
          'date_arr'.getItem(0).alias('Year'))

answered Feb 20, 2018 at 2:08

Shaido

28.6k26 gold badges76 silver badges82 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Steven · Accepted Answer · 2018-02-19 10:45:22Z

If you have a string in the same format you showed us, then you need to do perform the tasks :

Split the columns in as many array you have with explode
Remove the [ and ]
Split the date by , comma
assign each member of the new array to its column.

If I write you the code :

from pyspark.sql import functions as F

df.select("Person",
          "Amount",
          "Budget",
          F.explode(F.split("Date", '],')).alias("date_array")
         ).select("Person",
                  "Amount",
                  "Budget",
                  F.split(F.translate(F.translate("date_array", '[', ''), ']', ''), ',').alias("date_array")
                 ).select("Person",
                          "Amount",
                          "Budget",
                          F.col("date_array").getItem(0).alias("Day"),
                          F.col("date_array").getItem(1).alias("Month"),
                          F.col("date_array").getItem(2).alias("Year"),
                         ).show()

+------+------+---------+---+-----+----+
|Person|Amount|   Budget|Day|Month|Year|
+------+------+---------+---+-----+----+
|   Bob|   562|     Food| 29| June|2018|
|   Bob|   562|     Food| 12|  May|2018|
|   Bob|   880|     Food| 01| June|2018|
|   Bob|   380|Household| 16| June|2018|
+------+------+---------+---+-----+----+

Collectives™ on Stack Overflow

Convert String to ArrayType in column and explode

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related