From the course: Complete Guide to Databricks for Data Engineering

Use filter and where transformations in PySpark - Databricks Tutorial

From the course: Complete Guide to Databricks for Data Engineering

Use filter and where transformations in PySpark

- [Presenter] Reading the files in the Databricks is just only the initial step. The data engineer has to spend most of the time in cleaning this file and reading this data to analyze. So let's just see how we can clean this data up, and how we can analyze this. So for that, let's just go back to our customer analysis notebook. Many times, you get records which you don't want it. There might be like millions of the record is there in the file, and there are only few relevant records out there for you. In that case, you can use the filter function to filter out the records. For example, let's say I want to filter out the records where my customer type is only of a specific category that is VIP. So what I can do, I can say df1.data frame.filter, and here, I can write something like this, df my customer_type = VIP, and that's it. And after that, I can say display this specific DataFrame and execute. Now you will find that this is displaying all the records where customer type is VIP. This is not only just a single way to filter out the records. Databricks allows you to write the same piece of a code in multiple ways. For example, we can import from pyspark.sql.functions, and we can import col and column function. Now, col and column function also utilize to access a specific column in the filter. How, let us see. The same statement which we have written here can be written like this, df2 = df1.filter, and instead of using the column name like DataFrame brackets, you can use here using a col function. Say col, and within that col, you can type col, and within that, you can give the column name, and you can give a one specific customer type that you are looking for, and let's just see how it looks like. Now you can see that this way, also you can filter out. Here, you are getting all the customer type which is of Regular. Similar to this, you can use the longer version of it. How you can use that, you can say instead of the col, you can use the column function, and that will also work pretty same. For example, I can say Premium, and let's run it. We are here displaying df2. Let me just change and display DataFrame 3. So we'll find that there is no customer type like Premium. So this is how you can use the filter function, and you can write it in a different way. So you can write normally like this, using the square brackets. You can use col function, you can use column. In fact, there is one more way to write the same thing. You can say df.filter, and this time, you can use dot operator. You can say df.customer_type = VIP. This way also, you can write the filter function. So the idea is you can access the column in multiple ways, using col, column, dot operator, or the square brackets. Not only this, sometimes you wanted to have this filter condition on a multiple situation. For example, you want to say that DataFrame filter, not only just give me the customer who is of VIP type, but I'm looking for customer whose type is VIP and they are located within the U.S. only, so I'm looking for the VIP customer from the U.S. only. In that case, how I can give it? So I can use the & operator. I can say &, and then I can put another condition saying that df.country = USA, and then I can close the bracket. Now you can see that right now, we are getting the VIP customer from all the country. If I just go and execute this, I will get the customer only from a specific U.S. country. It's saying df filter has error because I forget to add the dot here. Let me execute it again. Now you can see that we are getting the customers whose type is VIP and that belongs to country U.S.A. Similar to this, there is one more function out there for doing that filtration, and that is called as the where, so same thing you can do with the where function as well. You can say df6 =. This time, let's pull all the records whose customer type is VIP and country is equal to U.S.A, but instead of using filter function, I'm going to use where function, and let's just see that whether it worked or not. You will find that the where function works exactly the same. Like & operator, sometimes you are looking for or condition as well, where you're looking one of the combination. For example, I am looking for the customers who are either of a VIP category or who are living in the U.S.A. In that case, I can use the pipe operator. The pipe operator works as an or condition, and it will pick all the customers whose type is VIP or country is U.S.A. Let me just execute this, and you can see that we're getting all the customers from a variety of countries as well. And if you go a little bit down, you'll see the customer type of VIP and regular is coming, but the customers who are regular, we are getting it U.S.A country. So if any of the condition becomes true, we get that record coming in our results. So this is how you can use the filter and the where function. Remember, filter and where work exactly the same way, and when you want to filter out, there are multiple ways using the square brackets, using col, column function, or you can use the dot operator. You can pick any one style based on your comfortability, and you can continue using that. Let's move to the next video.

Contents