From the course: PySpark Essential Training: Introduction to Building Data Pipelines

Writing data

- [Instructor] Until now, we've only worked with data frames in memory in our notebook, but in a real-world application, we usually want to write some data back into a database or a file to have a permanent record of it. Writing data to a file is pretty straightforward in PySpark as the dataframe class already has a write method we can use. Let's create a new output file with the results of a group-by calculation. First, we want to assign the result of the group by to a new dataframe for convenience. Then we write the dataframe to a CSV file. In this case, we choose CSV as it's a really small table, so we don't need a more efficient format like Parquet. This tells PySpark to write the output to a directory called average fare. Include a header row, and if necessary, override any existing files. You can check the output of this statement directly from your Google call app UI. On the left-hand side in your call app window, navigate to the bottom icon that looks like a folder. Click it to open up a directory tree. Navigate to Drive, My Drive, and then the name of your PySpark training directory. You will see a folder called Average Fare, which contains a single CSV file that starts with Part. By default, PySpark creates one output file per partition of your data. This is one of the main differences between how PySpark handles data and how Pandas handles data. PySpark always assumes that your data will be distributed. In case you're curious about how to write a dataframe to a database table or a data warehouse instead of a file, I recommend checking out the PySpark documentation to learn more.

Contents