From the course: PySpark Essential Training: Introduction to Building Data Pipelines
Unlock this course with a free trial
Join today to access over 24,900 courses taught by industry experts.
Challenge: PySpark SQL - Python Tutorial
From the course: PySpark Essential Training: Introduction to Building Data Pipelines
Challenge: PySpark SQL
(upbeat music) - [Trainer] This is the final hands-on challenge of this course, so let's combine several things that we've learned. Assume we want to determine the average total taxi ride cost for each drop-off borough. This means, how expensive are the rides depending on which borough of New York City they end in? Write code for the following steps: Step one: load the January taxi ride data into a dataframe called taxi_jan2025 and register a temporary view with the same name. Step two: load the taxi zone lookup data into a dataframe called taxi_lookup and register a temporary view with the same name. Step three: use PySpark SQL to left join those two tables on the DOLocation ID and the LocationID columns. Make sure to only select the DOLocationID, Borough and total_amount columns. And assign the result to a dataframe named joined_df. Step four: using the PySpark dataframe syntax, group the result by the Borough column and calculate the average total_amount using the avg method. Make…