11

I want to filter out rows in Spark DataFrame that have Email column that look like real, here's what I tried:

df.filter($"Email" match {case ".*@.*".r => true case _ => false})

But this doesn't work. What is the right way to do it?

1

1 Answer 1

37

To expand on @TomTom101's comment, the code you're looking for is:

df.filter($"Email" rlike ".*@.*")

The primary reason why the match doesn't work is because DataFrame has two filter functions which take either a String or a Column. This is unlike RDD with one filter that takes a function from T to Boolean.

Sign up to request clarification or add additional context in comments.

3 Comments

Matthew, this bit of code works in the Spark REPL but not in my main file. What import do I need to run this?
@BruceWayne, the REPL configures some things for you that you need to configure yourself in an App. So two Qs: 1) Does your function work with simpler operations (you can successfully create and count a dataframe, for example)? 2) are you running the same Spark version for your REPL and wherever the main file is run?
Matthew, you're right. I needed to add val sqlContext= new org.apache.spark.sql.SQLContext(sc) and import sqlContext.implicits._

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.