From the course: PySpark Essential Training: Introduction to Building Data Pipelines
Unlock this course with a free trial
Join today to access over 24,900 courses taught by industry experts.
The Apache Spark ecosystem - Python Tutorial
From the course: PySpark Essential Training: Introduction to Building Data Pipelines
The Apache Spark ecosystem
- [Instructor] Let's take a quick look at the different technologies that you might come across when you start working with Spark and PySpark. There are lots of different components with similar names, so I want to make sure you can navigate them. The foundation of Spark is called Spark Core, which provides distributed task dispatching, scheduling, and basic input/output functionalities. Spark uses a concept called the resilient distributed dataset, RDD, which is a read-only multiset of data items that's distributed over a cluster of machines and maintained in a fault-tolerant way. A dataframe in Spark is a higher-level abstraction on top of those RDDs that's optimized for structured data and tabular data processing. A dataframe is conceptually the same as a table in a relational database or a dataframe in Python. On top of Spark Core, we have Spark SQL. Spark SQL is a Spark module that allows you to query data both in RDDs and in external sources, such as relational databases. The…