Extract multiple substrings from column in pyspark

Question

I have a pyspark DataFrame with only one column as follows:

df = spark.createDataFrame(["This is AD185E000834", "U1JG97297 And ODNO926902 etc.","DIHK2975290;HI22K2390279; DSM928HK08", "there is nothing here."], "string").toDF("col1")

I would like to extract the codes in col1 to other columns like:

df.col2 = ["AD185E000834", "U1JG97297", "DIHK2975290", None]
df.col3 = [None, "ODNO926902", "HI22K2390279", None]
df.col4 = [None, None, "DSM928HK08", None]

Does anyone know how to do this? Thank you very much.

@wwnde The logic is to extract the codes as strings of uppercase letters and numbers. It should be "ODNO926902", not "And ODNO926902" because there is " " seperating "And" and "ODNO926902". — Liselotte
– Liselotte, Commented Apr 5, 2022 at 14:05

wwnde · Accepted Answer · 2022-04-05 15:13:18Z

1

I believe this can be shortened. Went long hand to give you my logic. Would have been easier if you laid down your logic in the question

#split string into array
df1=df.withColumn('k', split(col('col1'),'\s|\;')).withColumn('j', size('k'))

#compute maximum array length
s=df1.agg(max('j').alias('max')).distinct().collect()[0][0]


df1 =(df1.withColumn('k',expr("filter(k, x -> x rlike('^[A-Z0-9]+$'))"))#Filter only non alphanumeric characters in the array
     
      #Convert resulting array into struct to allow split
      .withColumn(
    "k",
    F.struct(*[
        F.col("k")[i].alias(f"col{i+2}") for i in range(s)
    ])
))

#Split struct column in df1 and join back to df
df.join(df1.select('col1','k.*'),how='left', on='col1').show()

+--------------------+------------+------------+----------+----+
|                col1|        col2|        col3|      col4|col5|
+--------------------+------------+------------+----------+----+
|DIHK2975290;HI22K...| DIHK2975290|HI22K2390279|DSM928HK08|null|
|This is AD185E000834|AD185E000834|        null|      null|null|
|U1JG97297 And ODN...|   U1JG97297|  ODNO926902|      null|null|
|there is nothing ...|        null|        null|      null|null|
+--------------------+------------+------------+----------+----+

edited Apr 5, 2022 at 15:13

answered Apr 5, 2022 at 15:06

wwnde

26.7k6 gold badges22 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Pyd Over a year ago

@wwnde can you please help here stackoverflow.com/questions/71827270/…

Ric S · Accepted Answer · 2022-04-05 15:25:15Z

1

As you said in your comment, here we are assuming that your "codes" are strings of at least two characters composed only by uppercase letters and numbers.

That being said, as of Spark 3.1+, you can use regexp_extract_all with an expr function to create a temporary array column with all the codes, then dynamically create multiple columns for each entry of the arrays.

import pyspark.sql.functions as F

# create an array with all the identified "codes"
new_df = df.withColumn('myarray', F.expr("regexp_extract_all(col1, '([A-Z0-9]{2,})', 1)"))

# find the maximum amount of codes identified in a single string
max_array_length = new_df.withColumn('array_length', F.size('myarray')).agg({'array_length': 'max'}).collect()[0][0]
print('Max array length: {}'.format(max_array_length))

# explode the array in multiple columns
new_df.select('col1', *[new_df.myarray[i].alias('col' + str(i+2)) for i in range(max_array_length)]) \
  .show(truncate=False)



Max array length: 3
+------------------------------------+------------+------------+----------+
|col1                                |col2        |col3        |col4      |
+------------------------------------+------------+------------+----------+
|This is AD185E000834                |AD185E000834|null        |null      |
|U1JG97297 And ODNO926902 etc.       |U1JG97297   |ODNO926902  |null      |
|DIHK2975290;HI22K2390279; DSM928HK08|DIHK2975290 |HI22K2390279|DSM928HK08|
|there is nothing here.              |null        |null        |null      |
+------------------------------------+------------+------------+----------+

edited Apr 5, 2022 at 15:25

answered Apr 5, 2022 at 15:14

Ric S

9,3184 gold badges30 silver badges57 bronze badges

3 Comments

Liselotte Over a year ago

Thank you very much. Such a fast and elegant solution if I know the number of codes in col1!

Ric S Over a year ago

@Liselotte you don't need to know in advance how many of them there are. With F.size you count them in max_array_length and then create multiple columns dynamically!

Pyd Over a year ago

@RicS can you please help this stackoverflow.com/questions/71827270/…

Collectives™ on Stack Overflow

Extract multiple substrings from column in pyspark

2 Answers 2

1 Comment

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related