From the course: PySpark Essential Training: Introduction to Building Data Pipelines
Spark vs. PySpark - Python Tutorial
From the course: PySpark Essential Training: Introduction to Building Data Pipelines
Spark vs. PySpark
- [Instructor] Now that we've introduced Apache Spark, let's take a closer look at PySpark and how it compares to Apache Spark. Apache Spark is written in the Scala programming language. Scala is a general purpose programming language built atop the Java virtual machine, the JVM. PySpark is a Python API for Spark, so that programmers can use the Spark capabilities from within a Python environment. This means that with PySpark, you can efficiently process data using Python and SQL, which makes it suitable for many data engineering teams that already use Python and SQL in their data pipelines. You might have heard of Pandas, or maybe you're already using it. PySpark has similar capabilities to Pandas, but it's much more suited for large data sets that benefit from distributed and parallel computation. The main applications of PySpark are building data pipelines to transform your data, data analysis on large data sets, processing real-time streaming data, and machine learning. In addition to the execution engine, PySpark also provides an interactive shell that you can use for data analysis in your terminal. PySpark supports all of the Spark features that I mentioned earlier. So in summary, the biggest difference between Spark and PySpark is the usability for developers and teams who already use Python.