Big data and Hadoop
Learn how Hadoop deals problems associated with Big data Analysis
Srikanth M V
There are 30 billion pieces of content
shared on Facebook every day.
Wal-Mart handles more than 1
million customer transactions an
hour.
More than 5 billion people are
calling, texting, tweeting and
browsing websites using
Smart phones.
The 3-Vs of Big Data
Volume
Giga Bytes, Tera Bytes,
Peta Bytes or Zeta
Bytes….
Velocity
The rate at which data
flows into an
organization
Variety
Structured and
Unstructured
So What is Big Data?
• Big data is large and complex data sets collected
from various sources like Sensors, Social Media,
Satellite images, Audio, Video, RFID etc.
• Big data is data that exceeds the processing
capacity of conventional database systems.
• How ‘Big’ is big?
 GB, TB, PB , ZB?? NO..
 Data is Big when the organization’s ability to handle,
store and analyze exceeds its capacity.
Problem:
Storing and Analyzing
“Big Data”
Solution*:
Move compute to data
*One among many, but Hadoop is flexible, Simple and reliable.
Hey there, I’m
Hadoop and I
can do that
for you..
• Created: 2005
• Creators: Doug Cutting and Mike Cafarella
• Contributors: Apache, Yahoo, Google
• Language: Java
How Hadoop deals with “Big data”
• Primary Components
• HDFS – Hadoop Distributed File System
• Map Reduce
• Hadoop YARN
• Job Scheduling and Resource Management
• Hadoop Common
• Access to file system
HDFS
• Distributed, scalable,
reliable and portable file
system.
• Hadoop Cluster is a set of
Data Nodes and a Name
Node
• Client divides the data to
process, into blocks
• Each block of data is
replicated in 3 Nodes*
• More Nodes, More
Efficiency.
• Robust - Relies on Software
instead of hardware
HDFS Cluster
Server
Data Node Name Node
Server
Data Node
Server
Data Node
Server
Data Node
M
a
s
t
e
r
S
l
a
v
e
s
B1 B2 B3
Somefile.txt
B1
B2
B3
B2
B3
B1
B3
B1
B3
Map Reduce
• Divide and Conquer
• Parallel Computing
• Map(): Perform Sorting &
Filtering
• Reduce(): Perform
Summary Operation
• Each node has Task tracker
which communicates with
Job Tracker.
• The output files will be
available as local files on
client.
Hadoop Architecture
Hadoop Secondary Components
• Ambari
• Web Tool for provisioning, managing and monitoring Clusters
• Hbase
• Scalable distributed database that supports structured data for large tables
• Zoo Keeper
– A High performance coordination service for distributed applications
• Pig
– A High level data flow language and execution framework for parallel computation
• Hive
– A Data warehouse infrastructure that provides data summarization and ad hoc querying
• Cassandra
– A scalable multi master database with no single point failures
• Chukwa
– A data collection system for managing large distributed systems
• Lucene and Solr
– Search engines, currently not part of Hadoop
Real World Example of Big data
Analytics using Hadoop
MySql
Database
1
7
2 3
56
1. Users interact with Facebook using data in textual, image, video formats.
2. Facebook transfers the core data to My SQL database.
3. My SQL data is replicated to Hadoop clusters.
4. Data is processed using Hadoop MapReduce functions
5. The results are transferred back to My SQL
6. Facebook uses the data to create recommendations for you based on
your interests.
4
Other users:
Why should an Enterprise move to Big
Data Analytics?
• Enterprises will be able to
harness relevant data and
use it to make the best
decisions
– Increasing the redemption
rate
– Determine optimum prices
– Calculate risks in a minute,
and understand future
possibilities to mitigate risk
– Enabling new products
– Identifying patterns help
identify trends in business
The key lies in collecting quality data, not quantity.
What is in it for us?
Hadoop on Cloud
• Provision Scalable Storage for storing Big data as Blobs– PAAS
• Provision Linux VMs on Cloud – IAAS
• Language support for JS and C#
• Business Intelligence – Connect MS Excel to Hadoop Hive
• Remote Access to Hadoop Jobs via REST API, WebHCat REST API.
• Easy to access Management Portal for monitoring Hadoop Jobs
• .NET SDK to execute Hive Jobs on HDInsight
• …..More
+
Thank you
• Questions ?
Vishwanath.srikanth@gmail.com
http://Vishwanathsrikanth.wordpress.com

Big data and hadoop

  • 1.
    Big data andHadoop Learn how Hadoop deals problems associated with Big data Analysis Srikanth M V
  • 2.
    There are 30billion pieces of content shared on Facebook every day.
  • 3.
    Wal-Mart handles morethan 1 million customer transactions an hour.
  • 4.
    More than 5billion people are calling, texting, tweeting and browsing websites using Smart phones.
  • 6.
    The 3-Vs ofBig Data Volume Giga Bytes, Tera Bytes, Peta Bytes or Zeta Bytes…. Velocity The rate at which data flows into an organization Variety Structured and Unstructured
  • 7.
    So What isBig Data? • Big data is large and complex data sets collected from various sources like Sensors, Social Media, Satellite images, Audio, Video, RFID etc. • Big data is data that exceeds the processing capacity of conventional database systems. • How ‘Big’ is big?  GB, TB, PB , ZB?? NO..  Data is Big when the organization’s ability to handle, store and analyze exceeds its capacity.
  • 8.
  • 9.
    Solution*: Move compute todata *One among many, but Hadoop is flexible, Simple and reliable. Hey there, I’m Hadoop and I can do that for you..
  • 10.
    • Created: 2005 •Creators: Doug Cutting and Mike Cafarella • Contributors: Apache, Yahoo, Google • Language: Java
  • 11.
    How Hadoop dealswith “Big data” • Primary Components • HDFS – Hadoop Distributed File System • Map Reduce • Hadoop YARN • Job Scheduling and Resource Management • Hadoop Common • Access to file system
  • 12.
    HDFS • Distributed, scalable, reliableand portable file system. • Hadoop Cluster is a set of Data Nodes and a Name Node • Client divides the data to process, into blocks • Each block of data is replicated in 3 Nodes* • More Nodes, More Efficiency. • Robust - Relies on Software instead of hardware HDFS Cluster Server Data Node Name Node Server Data Node Server Data Node Server Data Node M a s t e r S l a v e s B1 B2 B3 Somefile.txt B1 B2 B3 B2 B3 B1 B3 B1 B3
  • 13.
    Map Reduce • Divideand Conquer • Parallel Computing • Map(): Perform Sorting & Filtering • Reduce(): Perform Summary Operation • Each node has Task tracker which communicates with Job Tracker. • The output files will be available as local files on client.
  • 14.
  • 15.
    Hadoop Secondary Components •Ambari • Web Tool for provisioning, managing and monitoring Clusters • Hbase • Scalable distributed database that supports structured data for large tables • Zoo Keeper – A High performance coordination service for distributed applications • Pig – A High level data flow language and execution framework for parallel computation • Hive – A Data warehouse infrastructure that provides data summarization and ad hoc querying • Cassandra – A scalable multi master database with no single point failures • Chukwa – A data collection system for managing large distributed systems • Lucene and Solr – Search engines, currently not part of Hadoop
  • 16.
    Real World Exampleof Big data Analytics using Hadoop MySql Database 1 7 2 3 56 1. Users interact with Facebook using data in textual, image, video formats. 2. Facebook transfers the core data to My SQL database. 3. My SQL data is replicated to Hadoop clusters. 4. Data is processed using Hadoop MapReduce functions 5. The results are transferred back to My SQL 6. Facebook uses the data to create recommendations for you based on your interests. 4 Other users:
  • 17.
    Why should anEnterprise move to Big Data Analytics? • Enterprises will be able to harness relevant data and use it to make the best decisions – Increasing the redemption rate – Determine optimum prices – Calculate risks in a minute, and understand future possibilities to mitigate risk – Enabling new products – Identifying patterns help identify trends in business The key lies in collecting quality data, not quantity.
  • 18.
    What is init for us?
  • 19.
    Hadoop on Cloud •Provision Scalable Storage for storing Big data as Blobs– PAAS • Provision Linux VMs on Cloud – IAAS • Language support for JS and C# • Business Intelligence – Connect MS Excel to Hadoop Hive • Remote Access to Hadoop Jobs via REST API, WebHCat REST API. • Easy to access Management Portal for monitoring Hadoop Jobs • .NET SDK to execute Hive Jobs on HDInsight • …..More +
  • 20.
    Thank you • Questions? Vishwanath.srikanth@gmail.com http://Vishwanathsrikanth.wordpress.com