pyspark extracting a string using python

Question

Spark dataframe which has column emailID : [email protected]. i would like to extract the string between "." and "@" i.e 78uy and store it in column. tried

split_for_alias = split(rs_csv['emailID'],'[.]') 
rs_csv_alias= rs_csv.withColumn('alias',split_for_alias.getItem(size(split_for_alias) -2))

Its adding 78uy@testing as alias. Another column can be added and chop off the extra values. But is it possible to do in single statement.

wwnde · Accepted Answer · 2022-02-07 05:17:10Z

1

Extract the alphanumeric immediately to the left of special character . and immediately followed by special character @

DataFrame

data= [
      (1,"[email protected]"),
      (2, "[email protected]")
    ]

df=spark.createDataFrame(data, ("id",'emailID'))

df.show()

+---+--------------------+
| id|             emailID|
+---+--------------------+
|  1|am.shyam.78uy@tes...|
|  2|    [email protected]|
+---+--------------------+

Code

df.withColumn('name', regexp_extract('emailID', '(?<=\.)(\w+)(?=\@)',1)).show()

outcome

+---+--------------------+----+
| id|             emailID|name|
+---+--------------------+----+
|  1|am.shyam.78uy@tes...|78uy|
|  2|    [email protected]|kilo|
+---+--------------------+----+

answered Feb 7, 2022 at 5:17

wwnde

26.7k6 gold badges22 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Kevin Kho · Accepted Answer · 2022-02-08 03:00:07Z

We made the Fugue project to port native Python or Pandas code to Spark or Dask. This lets you can keep the logic very readable by expressing it in native Python. Fugue can then port it to Spark for you with one function call.

First we setup a Pandas DataFrame to test:

import pandas as pd
df = pd.DataFrame({"id":[1,2],"email": ["[email protected]", "[email protected]"]})

Next, we make a native Python function. The logic is clear this way.

from typing import List, Dict, Any
def extract(df:List[Dict[str,Any]]) -> List[Dict[str,Any]]:
    for row in df:
        email = row["email"].split("@")[0].split(".")[-1]
        row["new_col"] = email
    return df

Then we can test on the Pandas engine:

from fugue import transform
transform(df, extract, schema="*, new_col:str")

Because it works, we can bring it to Spark by supplying an engine:

import fugue_spark
transform(df, extract, schema="*, new_col:str", engine="spark").show()

+---+--------------------+-------+
| id|               email|new_col|
+---+--------------------+-------+
|  1|am.shyam.78uy@tes...|   78uy|
|  2|    [email protected]|   kilo|
+---+--------------------+-------+

Note .show() is needed because Spark evaluates lazily. This transform can take in both Pandas and Spark DataFrames and will output a Spark DataFrame if using the Spark engine.

Collectives™ on Stack Overflow

pyspark extracting a string using python

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related