(Not Just Another)
Overview of Apache Hadoop
Bob Wakefield
Principal
bob@MassStreet.net
Twitter:
@BobLovesData
Who is Mass Street?
•Sole proprietor data
consultancy
•Will start providing Big Data
solutions in the near future
•Looking for partners
•Especially Hadoop engineers
This evening’s objectives
1. Convince you that
Hadoop is awesome!
2. Convince you to convince
your boss that Hadoop is
awesome!
Yo Bob! I thought we were
going to learn how to do stuff
with MongoDB tonight!
There is a disconnect
between the hype (and
reality) of “Big Data” and the
number of organizations that
are ready to DO “Big Data”.
How do I know if I should be
taking a look at Hadoop?
If you have to make hard choices
about how much historical data
you are going to store...
you might need Hadoop.
If your analyst are spending more
time fixing clunky ETL processes
that look like they were designed by
Rube Goldberg instead of delivering
results to decision makers...
you might need Hadoop.
If you are doing crazy inappropriate
things with your warehouse load
to get closer to real time analytics...
you might need Hadoop.
If you’re trying to shove
unstructured data into a RDBMS...
you might need Hadoop.
If you’re even thinking about doing
a data warehouse project...
you might need Hadoop.
If you’re spending hundreds of
thousands of dollars on data storage
solutions...
you might need Hadoop.
EDW = $15,000 - $80,000 perTB
Hadoop = $2,000 - $6,000 perTB
Source: Santosh Chitakki,VP of Appfluent
If you are having a hard time
crunching numbers with the
resources at your disposal...
you might need Hadoop.
What is Hadoop?
• The savior of us all!
• More mature than you think!
• Been around since 2006
• Hadoop is for everybody!
What is Hadoop?
• It’s a paradigm
• It’s a framework
• It’s a collection of software
• It’s a partridge in a pear tree
No Really! What is Hadoop?
• Provides distributed fault tolerant data
storage
• Provides linear scalability on commodity
hardware
• Translation: Take all of your data, throw it
across a bunch of cheap machines, and
analyze it. Get more data? Add more
machines
So you’re going to try to explain
Hadoop to your boss? Just say no to
technobabble.
Top Reasons to Implement Hadoop
Reason #1: Hadoop
makes money
Reason #2: Hadoop
saves money
Reason #1: Hadoop makes money
• Cottage industry growing up
around Big Data
• Turns data into a potential source
of revenue
• Enables the kind of wiz bang
analysis you’re always hearing
about
Reason #2: Hadoop saves money
• Drastically reduces the cost of
storing data
• Eliminates a lot of the ETL
intensive work found in the old
world
OK Bob. You’ve piqued my interes
Where do I go to get started with
this stuff?
It’s like a child’s erector set!
Node
• A single computer
Rack
• Collection of nodes
• All nodes connected by
single switch
• Stored close together
• High bandwidth
Cluster
• Collection of racks
• Cluster can consist of a
single node
• Rack awareness
HDFS
• Hadoop Distributed File System
• The data operating system!
• Manages nodes in the cluster
• Scalable and highly fault tolerant
HDFS
• Mechanics
• Cuts up data into blocks and spreads across
nodes
• Replicates blocks across nodes
• Process optimization
HDFS
• Components
• Name node
• Data node
• A bunch of other nodes
MapReduce
• This is how Google indexes the web
• It’s a low level programming framework for
pulling data out of the cluster
• Communicates with HDFS
• Designed for batch processing
• Can use any language to write MapReduce
jobs
• How does MR work? Pffffft!
YARN
• Decouples HDFS from MapReduce
• Allows you to run other apps besides
MapReduce
TEZ
• Distributed execution framework
• Replaces MapReduce
• Written for other frameworks like Hive and
Pig
• Huge performance gains over MapReduce
Hive
• Hadoop warehouse solution
• SQLesque language called Hive Query
Language
• Adds structure to unstructured data
• Provides a window into HDFS
HBASE / Cassandra
• Both column family NoSQL databases
• There is a difference in how they store data
• Helps solve the append only “problem”.
OODA Loop
Kafka / Storm /
Trident• Kafka – an open source distributed pub/sub
messaging system
• Storm – real time computation framework
• Both are distributed and designed for
horizontal scale
• Guarantees at least once processing
• Batch + Real Time = Lambda Architecture
Slide Credit: MapR
Slide Credit: YouTube presentation “Predictive Analytics using Storm, Hadoop, R and AWS”
Honorable Mention
• Sqoop – ETL tool
• Pig – data wrangling tool
• Drill – legit SQL
• Mahout – Java machine learning library
• HCatalog – HDFS abstraction
• SAMOA – real time machine learning
Getting Started
• Hortonworks
• Sandbox
• YouTube
• Elastic MapReduce
• Misnomer!!
• Kansas City Data Engineering at Scale Meetup
Possible Future Topics
• Building a real time analytics solution step by
step
• Streaming machine learning with SAMOA
Story time/Case study?

Not Just Another Overview of Apache Hadoop

  • 1.
    (Not Just Another) Overviewof Apache Hadoop Bob Wakefield Principal bob@MassStreet.net Twitter: @BobLovesData
  • 2.
    Who is MassStreet? •Sole proprietor data consultancy •Will start providing Big Data solutions in the near future •Looking for partners •Especially Hadoop engineers
  • 3.
    This evening’s objectives 1.Convince you that Hadoop is awesome! 2. Convince you to convince your boss that Hadoop is awesome!
  • 4.
    Yo Bob! Ithought we were going to learn how to do stuff with MongoDB tonight!
  • 7.
    There is adisconnect between the hype (and reality) of “Big Data” and the number of organizations that are ready to DO “Big Data”.
  • 10.
    How do Iknow if I should be taking a look at Hadoop?
  • 11.
    If you haveto make hard choices about how much historical data you are going to store... you might need Hadoop.
  • 12.
    If your analystare spending more time fixing clunky ETL processes that look like they were designed by Rube Goldberg instead of delivering results to decision makers... you might need Hadoop.
  • 13.
    If you aredoing crazy inappropriate things with your warehouse load to get closer to real time analytics... you might need Hadoop.
  • 14.
    If you’re tryingto shove unstructured data into a RDBMS... you might need Hadoop.
  • 15.
    If you’re eventhinking about doing a data warehouse project... you might need Hadoop.
  • 16.
    If you’re spendinghundreds of thousands of dollars on data storage solutions... you might need Hadoop. EDW = $15,000 - $80,000 perTB Hadoop = $2,000 - $6,000 perTB Source: Santosh Chitakki,VP of Appfluent
  • 17.
    If you arehaving a hard time crunching numbers with the resources at your disposal... you might need Hadoop.
  • 18.
    What is Hadoop? •The savior of us all! • More mature than you think! • Been around since 2006 • Hadoop is for everybody!
  • 19.
    What is Hadoop? •It’s a paradigm • It’s a framework • It’s a collection of software • It’s a partridge in a pear tree
  • 20.
    No Really! Whatis Hadoop? • Provides distributed fault tolerant data storage • Provides linear scalability on commodity hardware • Translation: Take all of your data, throw it across a bunch of cheap machines, and analyze it. Get more data? Add more machines
  • 21.
    So you’re goingto try to explain Hadoop to your boss? Just say no to technobabble.
  • 22.
    Top Reasons toImplement Hadoop Reason #1: Hadoop makes money Reason #2: Hadoop saves money
  • 23.
    Reason #1: Hadoopmakes money • Cottage industry growing up around Big Data • Turns data into a potential source of revenue • Enables the kind of wiz bang analysis you’re always hearing about
  • 24.
    Reason #2: Hadoopsaves money • Drastically reduces the cost of storing data • Eliminates a lot of the ETL intensive work found in the old world
  • 25.
    OK Bob. You’vepiqued my interes Where do I go to get started with this stuff?
  • 27.
    It’s like achild’s erector set!
  • 28.
    Node • A singlecomputer Rack • Collection of nodes • All nodes connected by single switch • Stored close together • High bandwidth
  • 29.
    Cluster • Collection ofracks • Cluster can consist of a single node • Rack awareness
  • 30.
    HDFS • Hadoop DistributedFile System • The data operating system! • Manages nodes in the cluster • Scalable and highly fault tolerant
  • 31.
    HDFS • Mechanics • Cutsup data into blocks and spreads across nodes • Replicates blocks across nodes • Process optimization
  • 32.
    HDFS • Components • Namenode • Data node • A bunch of other nodes
  • 33.
    MapReduce • This ishow Google indexes the web • It’s a low level programming framework for pulling data out of the cluster • Communicates with HDFS • Designed for batch processing • Can use any language to write MapReduce jobs • How does MR work? Pffffft!
  • 35.
    YARN • Decouples HDFSfrom MapReduce • Allows you to run other apps besides MapReduce
  • 36.
    TEZ • Distributed executionframework • Replaces MapReduce • Written for other frameworks like Hive and Pig • Huge performance gains over MapReduce
  • 37.
    Hive • Hadoop warehousesolution • SQLesque language called Hive Query Language • Adds structure to unstructured data • Provides a window into HDFS
  • 38.
    HBASE / Cassandra •Both column family NoSQL databases • There is a difference in how they store data • Helps solve the append only “problem”.
  • 39.
  • 40.
    Kafka / Storm/ Trident• Kafka – an open source distributed pub/sub messaging system • Storm – real time computation framework • Both are distributed and designed for horizontal scale • Guarantees at least once processing • Batch + Real Time = Lambda Architecture
  • 41.
  • 42.
    Slide Credit: YouTubepresentation “Predictive Analytics using Storm, Hadoop, R and AWS”
  • 43.
    Honorable Mention • Sqoop– ETL tool • Pig – data wrangling tool • Drill – legit SQL • Mahout – Java machine learning library • HCatalog – HDFS abstraction • SAMOA – real time machine learning
  • 44.
    Getting Started • Hortonworks •Sandbox • YouTube • Elastic MapReduce • Misnomer!! • Kansas City Data Engineering at Scale Meetup
  • 45.
    Possible Future Topics •Building a real time analytics solution step by step • Streaming machine learning with SAMOA
  • 46.