0

I have a pyspark DataFrame with only one column as follows:

df = spark.createDataFrame(["This is AD185E000834", "U1JG97297 And ODNO926902 etc.","DIHK2975290;HI22K2390279; DSM928HK08", "there is nothing here."], "string").toDF("col1")

I would like to extract the codes in col1 to other columns like:

df.col2 = ["AD185E000834", "U1JG97297", "DIHK2975290", None]
df.col3 = [None, "ODNO926902", "HI22K2390279", None]
df.col4 = [None, None, "DSM928HK08", None]

Does anyone know how to do this? Thank you very much.

1
  • @wwnde The logic is to extract the codes as strings of uppercase letters and numbers. It should be "ODNO926902", not "And ODNO926902" because there is " " seperating "And" and "ODNO926902". Commented Apr 5, 2022 at 14:05

2 Answers 2

1

I believe this can be shortened. Went long hand to give you my logic. Would have been easier if you laid down your logic in the question

#split string into array
df1=df.withColumn('k', split(col('col1'),'\s|\;')).withColumn('j', size('k'))

#compute maximum array length
s=df1.agg(max('j').alias('max')).distinct().collect()[0][0]


df1 =(df1.withColumn('k',expr("filter(k, x -> x rlike('^[A-Z0-9]+$'))"))#Filter only non alphanumeric characters in the array
     
      #Convert resulting array into struct to allow split
      .withColumn(
    "k",
    F.struct(*[
        F.col("k")[i].alias(f"col{i+2}") for i in range(s)
    ])
))

#Split struct column in df1 and join back to df
df.join(df1.select('col1','k.*'),how='left', on='col1').show()

+--------------------+------------+------------+----------+----+
|                col1|        col2|        col3|      col4|col5|
+--------------------+------------+------------+----------+----+
|DIHK2975290;HI22K...| DIHK2975290|HI22K2390279|DSM928HK08|null|
|This is AD185E000834|AD185E000834|        null|      null|null|
|U1JG97297 And ODN...|   U1JG97297|  ODNO926902|      null|null|
|there is nothing ...|        null|        null|      null|null|
+--------------------+------------+------------+----------+----+
Sign up to request clarification or add additional context in comments.

1 Comment

@wwnde can you please help here stackoverflow.com/questions/71827270/…
1

As you said in your comment, here we are assuming that your "codes" are strings of at least two characters composed only by uppercase letters and numbers.

That being said, as of Spark 3.1+, you can use regexp_extract_all with an expr function to create a temporary array column with all the codes, then dynamically create multiple columns for each entry of the arrays.

import pyspark.sql.functions as F

# create an array with all the identified "codes"
new_df = df.withColumn('myarray', F.expr("regexp_extract_all(col1, '([A-Z0-9]{2,})', 1)"))

# find the maximum amount of codes identified in a single string
max_array_length = new_df.withColumn('array_length', F.size('myarray')).agg({'array_length': 'max'}).collect()[0][0]
print('Max array length: {}'.format(max_array_length))

# explode the array in multiple columns
new_df.select('col1', *[new_df.myarray[i].alias('col' + str(i+2)) for i in range(max_array_length)]) \
  .show(truncate=False)



Max array length: 3
+------------------------------------+------------+------------+----------+
|col1                                |col2        |col3        |col4      |
+------------------------------------+------------+------------+----------+
|This is AD185E000834                |AD185E000834|null        |null      |
|U1JG97297 And ODNO926902 etc.       |U1JG97297   |ODNO926902  |null      |
|DIHK2975290;HI22K2390279; DSM928HK08|DIHK2975290 |HI22K2390279|DSM928HK08|
|there is nothing here.              |null        |null        |null      |
+------------------------------------+------------+------------+----------+

3 Comments

Thank you very much. Such a fast and elegant solution if I know the number of codes in col1!
@Liselotte you don't need to know in advance how many of them there are. With F.size you count them in max_array_length and then create multiple columns dynamically!
@RicS can you please help this stackoverflow.com/questions/71827270/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.