From the course: PySpark Essential Training: Introduction to Building Data Pipelines

Unlock this course with a free trial

Join today to access over 24,900 courses taught by industry experts.

Example production environment setup

Example production environment setup

- [Instructor] Okay, now that we have the basic requirements out of the way, let's look at a more concrete example of what running PySpark in production might look like. For this example, I'm going to assume we're using Amazon Web Services, AWS, as our cloud provider. In this setup, we're going to run Spark on a cluster of EC2 instances, and we'll use YARN as a cluster manager. That means you'll need to launch your own virtual machines using the EC2 service, install Spark and Hadoop, configure the nodes to talk to each other, and manage things like memory allocation and resource scheduling yourself. For our distributed storage, we can simply use Amazon S3. You might already be familiar with S3. It's an object storage service in AWS that can store huge amounts of data reliably and cheaply. In this setup, your PySpark job reads data files directly from S3, processes them, and then writes the results back to S3. PySpark supports S3 natively, so it's really easy to integrate. You just…

Contents