3

I'm trying to convert below dataframe into nested json (string)

input:

+---+---+-------+------+
| id|age| name  |number|
+---+---+-------+------+
|  1| 12|  smith|  uber|
|  2| 13|    jon| lunch|
|  3| 15|jocelyn|rental|
|  3| 15|  megan|   sds|
+---+---+-------+------+

output:-

+---+---+--------------------------------------------------------------------+
|id |age|values                                                              
|
+---+---+--------------------------------------------------------------------+
|1  |12 |[{"number": "uber", "name": "smith"}]                                   
|
|2  |13 |[{"number": "lunch", "name": "jon"}]                                     
|
|3  |15 |[{"number": "rental", "name": "megan"}, {"number": "sds", "name": "jocelyn"}]|
+---+---+--------------------------------------------------------------------+

my code

from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, StructField, StructType, StringType, IntegerType
# List
data = [(1,12,"smith", "uber"),
        (2,13,"jon","lunch"),(3,15,"jocelyn","rental")
        ,(3,15,"megan","sds")
        ]

# Create a schema for the dataframe
schema = StructType([
  StructField('id', IntegerType(), True),
  StructField('age', IntegerType(), True),
  StructField('number', StringType(), True),
    StructField('name', StringType(), True)])

# Convert list to RDD
rdd = spark.sparkContext.parallelize(data)

# Create data frame
df = spark.createDataFrame(rdd,schema)

I tried using collect_list and collect_set, was not able to get the desired ouput.

0

1 Answer 1

3

You can use collect_list and to_json to collect an array of jsons for each group:

import pyspark.sql.functions as F

df2 = df.groupBy(
    'id', 'age'
).agg(
    F.collect_list(
        F.to_json(
            F.struct('number', 'name')
        )
    ).alias('values')
).orderBy(
    'id', 'age'
)

df2.show(truncate=False)
+---+---+-----------------------------------------------------------------------+
|id |age|values                                                                 |
+---+---+-----------------------------------------------------------------------+
|1  |12 |[{"number":"smith","name":"uber"}]                                     |
|2  |13 |[{"number":"jon","name":"lunch"}]                                      |
|3  |15 |[{"number":"jocelyn","name":"rental"}, {"number":"megan","name":"sds"}]|
+---+---+-----------------------------------------------------------------------+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.