0

Spark dataframe which has column emailID : [email protected]. i would like to extract the string between "." and "@" i.e 78uy and store it in column. tried

split_for_alias = split(rs_csv['emailID'],'[.]') 
rs_csv_alias= rs_csv.withColumn('alias',split_for_alias.getItem(size(split_for_alias) -2)) 

Its adding 78uy@testing as alias. Another column can be added and chop off the extra values. But is it possible to do in single statement.

2 Answers 2

1

Extract the alphanumeric immediately to the left of special character . and immediately followed by special character @

DataFrame

data= [
      (1,"[email protected]"),
      (2, "[email protected]")
    ]

df=spark.createDataFrame(data, ("id",'emailID'))

df.show()

+---+--------------------+
| id|             emailID|
+---+--------------------+
|  1|am.shyam.78uy@tes...|
|  2|    [email protected]|
+---+--------------------+

Code

df.withColumn('name', regexp_extract('emailID', '(?<=\.)(\w+)(?=\@)',1)).show()

outcome

+---+--------------------+----+
| id|             emailID|name|
+---+--------------------+----+
|  1|am.shyam.78uy@tes...|78uy|
|  2|    [email protected]|kilo|
+---+--------------------+----+
Sign up to request clarification or add additional context in comments.

Comments

0

We made the Fugue project to port native Python or Pandas code to Spark or Dask. This lets you can keep the logic very readable by expressing it in native Python. Fugue can then port it to Spark for you with one function call.

First we setup a Pandas DataFrame to test:

import pandas as pd
df = pd.DataFrame({"id":[1,2],"email": ["[email protected]", "[email protected]"]})

Next, we make a native Python function. The logic is clear this way.

from typing import List, Dict, Any
def extract(df:List[Dict[str,Any]]) -> List[Dict[str,Any]]:
    for row in df:
        email = row["email"].split("@")[0].split(".")[-1]
        row["new_col"] = email
    return df

Then we can test on the Pandas engine:

from fugue import transform
transform(df, extract, schema="*, new_col:str")

Because it works, we can bring it to Spark by supplying an engine:

import fugue_spark
transform(df, extract, schema="*, new_col:str", engine="spark").show()
+---+--------------------+-------+
| id|               email|new_col|
+---+--------------------+-------+
|  1|am.shyam.78uy@tes...|   78uy|
|  2|    [email protected]|   kilo|
+---+--------------------+-------+

Note .show() is needed because Spark evaluates lazily. This transform can take in both Pandas and Spark DataFrames and will output a Spark DataFrame if using the Spark engine.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.