Join now Sign in

From the course: Cloud Hadoop: Scaling Apache Spark

File systems used with Hadoop and Spark - Apache Spark Tutorial

From the course: Cloud Hadoop: Scaling Apache Spark

Start my 1-month free trial Buy for my team

File systems used with Hadoop and Spark

“

- [Narrator] Let's talk a little bit more about modern File Systems for Hadoop. Core HDFS has not changed much. Now, with the some of the newer Apache distributions, the time the recording on my Hadoop fundamentals course, we were on Apache distribution 2.5 and the most current as of this recording is 2.7. There have been some improvements to the Core Distribution around enterprise needs such as encryption. However, the Core remains pretty much similar and it's used for the types of batch and extract transformation and load jobs that it traditionally has been. What's new on the File Systems for Hadoop, is this idea of Cloud-based file systems. So, for Amazon that's S3, for Google, that's Google Cloud Storage and for Microsoft Azure, that's Blob Storage. This is sometimes called a data lake and this is really changing the landscape of Hadoop because it's decoupling the need to take data that you pull into a file location and then transport it or move it over to HDFS. What it does, it enables more use of Hadoop because there's less steps in the processes and because the Cloud-based file systems are cheaper to store massive amounts of information than using HDSF, either locally or on the Cloud. So, this is a really important change in the world of Hadoop and I see more and more customers being interested in applying Hadoop style processes or MapReduces or some of these new processes like Spark jobs to files that they traditionally would such have an S3 and they would use some other mechanism for querying and looking at them. Examples of these files would be log files from all their nodes of their networks, globally. Another example would be event files from IOT devices. Another example would be medical information in addition to the Cloud-based file systems there are certain commercial Vendors and we're going to be focusing on Databricks, who I think is really leading the industry now. That are creating enhancements to the HDFS file system and offering alternatives. Their file system is the Databricks file system and we'll be working with that in this course as well. Now, as we think about file systems that we can use with Apache Spark. We have a number of choices. We could for testing use a file system that is basically local. So, part of the virtual machine, in our case with a Cloud-based virtual machine. And we're going to be looking at virtual machines that run on Linux and these are going to be EC2 on Amazon or Google Compute Engine on Google Cloud. This really is a configuration you'd only use for testing just to make sure that your job or your script worked properly. I mean, the point of using Hadoop or Spark is to leverage distributed highly resilient compute. So, as a default for a production job, you would start by looking at the Hadoop file system. So, if you're using a managed implementation such as Amazon Elastic MapReduce, you'll get an HDFS file system with it. Which is an abstraction as we've seen over the top of a regular file system. Similar, if you're using Google Cloud Platform Dataproc, which is managed Hadoop Spark, you'll get HDFS. Now, if you use Databricks which is a higher level abstraction running on, let's say AWS in this case, you'll get Databricks more optimized version of the distributed files system. Which they simply call DFS. And I point this out because there are some differences between the open source HDFS and DFS, you're paying for a greater optimization. So, if you're moving to production on Databricks, you're going to want to first consult the documentation so you understand the differences and you can basically take advantage of them. Now, in addition to using the standard HDFS or DFS, we will, in this course look at some new architectures built on data lakes on the Cloud. And the example that we'll look at in the most detail is one that my team and I built out for Genomics. It uses Amazon Elastic MapReduce. So, managed Hadoop Spark along with Amazon S3 as the object store or data lake and that actually replaces working with HDFS for most of the compute. Now, when I say most, it's really important when you're in the monitoring phases, as we'll look at. Some of the compute can be done, if when we're working with Spark for example, on the memory of all the worker nodes. However, if you don't configure properly and you have spillage, then that's going to go into the HDFS file system. So, there's a lot of complexity and that's why I wanted to take a minute and talk about the different file systems. So, that as you begin to move your Spark jobs to production, you can think about what's going to be best for you and you can understand which file systems will be involved in implementation of Spark jobs.

Contents