Hassnain Ali 15081598-066
Nadeem Tahir 15081598-106
What is Big Data?
“Big data is the data characterized by 4 key
attributes: volume, variety, velocity and
value.”
-- Oracle
Let’s look at
Big Data
in a different way.
Byte
Byte : one grain of rice
Kilobyte
Byte
Kilobyte
: one grain of rice
: cup of rice
Megabyte
Byte
Kilobyte
: one grain of rice
: cup of rice
Megabyte : 8 bags of rice
Gigabyte
Byte
Kilobyte
: one grain of rice
: cup of rice
Megabyte : 8 bags of rice
Gigabyte : 3 Semi trucks
Terabyte
Byte
Kilobyte
: one grain of rice
: cup of rice
Megabyte : 8 bags of rice
Gigabyte
Terabyte
: 3 Semi trucks
: 2 Container Ships
Petabyte
Byte
Kilobyte
: one grain of rice
: cup of rice
Megabyte : 8 bags of rice
Gigabyte
Terabyte
Petabyte
: 3 Semi trucks
: 2 Container Ships
: Blankets Manhattan
OEnxeabByyttee
Byte
Kilobyte
: one grain of rice
: cup of rice
Megabyte : 8 bags of rice
Gigabyte
Terabyte
Petabyte
Exabyte
: 3 Semi trucks
: 2 Container Ships
: Blankets Manhattan
: Blankets west coast states
Byte
Kilobyte
: one grain of rice
: cup of rice
Megabyte : 8 bags of rice
Gigabyte
Terabyte
Petabyte
Exabyte
: 3 Semi trucks
: 2 Container Ships
: Blankets Manhattan
: Blankets west coast states
Zettabyte : Fills the Pacific Ocean
Zettabyte
Byte
Kilobyte
: one grain of rice
: cup of rice
Megabyte : 8 bags of rice
Gigabyte
Terabyte
Petabyte
Exabyte
: 3 Semi trucks
: 2 Container Ships
: Blankets Manhattan
: Blankets west coast states
Zettabyte : Fills the Pacific Ocean
Yottabyte : A EARTH SIZE RICEBALL! Yottabyte
Hobbyist
Byte
Kilobyte
: one grain of rice
: cup of rice
Megabyte : 8 bags of rice
Gigabyte
Terabyte
Petabyte
Exabyte
: 3 Semi trucks
: 2 Container Ships
: Blankets Manhattan
: Blankets west coast states
Zettabyte : Fills the Pacific Ocean
Yottabyte : A EARTH SIZE RICEBALL!
Desktop
Hobbyist
Byte
Kilobyte
: one grain of rice
: cup of rice
Megabyte : 8 bags of rice
Gigabyte : 3 Semi trucks
Terabyte
Petabyte
Exabyte
: 2 Container Ships
: Blankets Manhattan
: Blankets west coast states
Zettabyte : Fills the Pacific Ocean
Yottabyte : A EARTH SIZE RICEBALL!
Desktop
Hobbyist
Internet
Byte
Kilobyte
: one grain of rice
: cup of rice
Megabyte : 8 bags of rice
Gigabyte : 3 Semi trucks
Terabyte
Petabyte
: 2 Container Ships
: Blankets Manhattan
Exabyte : Blankets west coast states
Zettabyte : Fills the Pacific Ocean
Yottabyte : A EARTH SIZE RICEBALL!
Desktop
Hobbyist
Internet
BigData
Byte
Kilobyte
: one grain of rice
: cup of rice
Megabyte : 8 bags of rice
Gigabyte : 3 Semi trucks
Terabyte
Petabyte
: 2 Container Ships
: Blankets Manhattan
Exabyte : Blankets west coast states
Zettabyte : Fills the Pacific Ocean
Yottabyte : A EARTH SIZE RICEBALL!
Byte
Kilobyte
: one grain of rice
: cup of rice
Megabyte : 8 bags of rice
Gigabyte : 3 Semi trucks
Terabyte
Petabyte
: 2 Container Ships
: Blankets Manhattan
Exabyte : Blankets west coast states
Zettabyte : Fills the Pacific Ocean
Yottabyte : A EARTH SIZE RICEBALL!
Desktop
Hobbyist
The Future?
Internet
BigData
Byte
Kilobyte
: one grain of rice
: cup of rice
Megabyte : 8 bags of rice
Gigabyte : 3 Semi trucks
Terabyte
Petabyte
: 2 Container Ships
: Blankets Manhattan
Exabyte : Blankets west coast states
Zettabyte : Fills the Pacific Ocean
Yottabyte : A EARTH SIZE RICEBALL!
Big Data is not about the size of the data,
it’s about the value within the data.
We are generating huge
amounts of data.
Data with a
lot of information.
… and a lot of noise.
The ability to hear the signal
from the noise is the key…
to unlocking the human conversation
that is taking place around us.
Did it work?
Most people don’t know
what to do with all the data
that they already have…
Get Big
by starting
small
Big Data isn’t big, if you know
how to use it.
Storing Big
Data
• Data start to play an increasingly important role in
business and science.
• Storing, searching, sharing, analysing and visualising big
data has become a challenge.
• Especially storing of data is often disregarded as an
issue. Note that sometimes a MySQL database is not
enough.
• Hadoop offers an out of the box distributed filesystem for
storing data files. However, the challenge appears when
someone needs DB capabilities, frequent updates or real
Problems Now A days
 Nowadays traditional relational databases can reach their limit
in performance.
 Data keep on coming in high velocity, high volumes, and high
variety.
 Common practices to increase performance fail after a while:
buying a faster server, getting more RAM, using materialised
views, fine tuning queries...
 Furthermore, “alter table” doesn’t really work with lots of
data. Backups and data availability becomes an issue.
NO SQL
• The term is too broad and new to really define it.
• No schema
• No joins between tables
• No common scripting language (like SQL)
• No ACID (atomicity, consistency, isolation, durability)
• On the other hand you gain horizontal scalability and high performance.
Also, most NoSQL systems are Map/Reduce ready and/or bind with
Hadoop.
MangoDB Example:-
A document is represented in JSON format:
{
“ id” : 12345678,
“Link” : “http://news.scotsman.com/abc.html”, “Title”:“Blah blah
blah”,
“Content”: “More blah blah”, “OutletID” : 14,
“Date” : ISODate(“2011-11-17T20:33:15.097Z”), “ Hash” :
550973592,
“Tags” : [ International, News, Scotland],
MongoDB - Replication
Master/Slave
Single Server
MongoDB - Sharding MongoDB
If new shard is added, data is balanced automaticall
Data Processing
 Without data processing, organizations have no access to
massive amounts of data that can help them gain a competitive
edge, give them insight into sales, marketing strategies and
consumer needs. It is imperative that companies large and small
understand the necessity of data processing.
 Data processing occurs when data is collected and translated
into usable information
The Six Stages of Data Processing
• Data Collection
• Data Preparation
• Data Input
• Processing
• Data Output/Interpretation
• Data Storage
The Future of Data Processing
The future of data processing lies in the cloud. Cloud technology
builds on the convenience of current electronic data processing
methods and accelerates its speed and effectiveness. Faster,
higher-quality data means more data for each organization to
utilize and more valuable insights to extract.
Big data tools:-
1. Apache Hadoop 2. Microsoft HDInsight
3. NoSQL 4. Hive
5. Sqoop
7. Big data in EXCEL 8. Presto
6. PolyBase
Big Data Techniques
Quantitative Analysis
Quantitative analysis is a data analysis technique that focuses on quantifying
the patterns and correlations found in the data. Based on statistical practices,
this technique involves analyzing a large number of observations from a dataset
Qualitative Analysis
Qualitative analysis is a data analysis technique that focuses
on describing various data qualities using words. It involves
analyzing a smaller sample in greater depth compared to
quantitative data analysis. These analysis results cannot be
generalized to an entire dataset due to the small sample size
DATA MINING
Data mining, also known as data discovery, is a specialized form of
data analysis that targets large datasets. In relation to Big Data
analysis, data mining generally refers to automated, software-based
techniques that sift through massive datasets to identify patterns and
trends.
STATISTICAL ANALYSIS
Statistical analysis uses statistical methods based on mathematical formulas as a
means for analyzing data. Statistical analysis is most often quantitative, but can also be
qualitative. This type of analysis is commonly used to describe datasets via
summarization, such as providing the mean, median, or mode of statistics associated
with the dataset.
MACHINE LEARNING
Humans are good at spotting patterns and relationships within data.
Unfortunately, we cannot process large amounts of data very quickly.
Machines, on the other hand, are very adept at processing large amounts of
data quickly, but only if they know how.
SEMANTIC ANALYSIS
A fragment of text or speech data can carry different meanings in different
contexts, whereas a complete sentence may retain its meaning, even if
structured in different ways. In order for the machines to extract valuable
information, text and speech data needs to be understood by the machines
in the same way as humans do. Semantic analysis represents practices for
extracting meaningful information from textual and speech data.
VISUAL ANALYSIS
Visual analysis is a form of data analysis that involves the
graphic representation of data to enable or enhance its visual
perception. Based on the premise that humans can
understand and draw conclusions from graphics more quickly
than from text, visual analysis acts as a discovery tool in the
field of Big Data.
Intro to big data and how it works

Intro to big data and how it works

  • 1.
  • 2.
  • 3.
    “Big data isthe data characterized by 4 key attributes: volume, variety, velocity and value.” -- Oracle
  • 4.
    Let’s look at BigData in a different way.
  • 5.
    Byte Byte : onegrain of rice
  • 6.
  • 7.
    Megabyte Byte Kilobyte : one grainof rice : cup of rice Megabyte : 8 bags of rice
  • 8.
    Gigabyte Byte Kilobyte : one grainof rice : cup of rice Megabyte : 8 bags of rice Gigabyte : 3 Semi trucks
  • 9.
    Terabyte Byte Kilobyte : one grainof rice : cup of rice Megabyte : 8 bags of rice Gigabyte Terabyte : 3 Semi trucks : 2 Container Ships
  • 10.
    Petabyte Byte Kilobyte : one grainof rice : cup of rice Megabyte : 8 bags of rice Gigabyte Terabyte Petabyte : 3 Semi trucks : 2 Container Ships : Blankets Manhattan
  • 11.
    OEnxeabByyttee Byte Kilobyte : one grainof rice : cup of rice Megabyte : 8 bags of rice Gigabyte Terabyte Petabyte Exabyte : 3 Semi trucks : 2 Container Ships : Blankets Manhattan : Blankets west coast states
  • 12.
    Byte Kilobyte : one grainof rice : cup of rice Megabyte : 8 bags of rice Gigabyte Terabyte Petabyte Exabyte : 3 Semi trucks : 2 Container Ships : Blankets Manhattan : Blankets west coast states Zettabyte : Fills the Pacific Ocean Zettabyte
  • 13.
    Byte Kilobyte : one grainof rice : cup of rice Megabyte : 8 bags of rice Gigabyte Terabyte Petabyte Exabyte : 3 Semi trucks : 2 Container Ships : Blankets Manhattan : Blankets west coast states Zettabyte : Fills the Pacific Ocean Yottabyte : A EARTH SIZE RICEBALL! Yottabyte
  • 14.
    Hobbyist Byte Kilobyte : one grainof rice : cup of rice Megabyte : 8 bags of rice Gigabyte Terabyte Petabyte Exabyte : 3 Semi trucks : 2 Container Ships : Blankets Manhattan : Blankets west coast states Zettabyte : Fills the Pacific Ocean Yottabyte : A EARTH SIZE RICEBALL!
  • 15.
    Desktop Hobbyist Byte Kilobyte : one grainof rice : cup of rice Megabyte : 8 bags of rice Gigabyte : 3 Semi trucks Terabyte Petabyte Exabyte : 2 Container Ships : Blankets Manhattan : Blankets west coast states Zettabyte : Fills the Pacific Ocean Yottabyte : A EARTH SIZE RICEBALL!
  • 16.
    Desktop Hobbyist Internet Byte Kilobyte : one grainof rice : cup of rice Megabyte : 8 bags of rice Gigabyte : 3 Semi trucks Terabyte Petabyte : 2 Container Ships : Blankets Manhattan Exabyte : Blankets west coast states Zettabyte : Fills the Pacific Ocean Yottabyte : A EARTH SIZE RICEBALL!
  • 17.
    Desktop Hobbyist Internet BigData Byte Kilobyte : one grainof rice : cup of rice Megabyte : 8 bags of rice Gigabyte : 3 Semi trucks Terabyte Petabyte : 2 Container Ships : Blankets Manhattan Exabyte : Blankets west coast states Zettabyte : Fills the Pacific Ocean Yottabyte : A EARTH SIZE RICEBALL!
  • 18.
    Byte Kilobyte : one grainof rice : cup of rice Megabyte : 8 bags of rice Gigabyte : 3 Semi trucks Terabyte Petabyte : 2 Container Ships : Blankets Manhattan Exabyte : Blankets west coast states Zettabyte : Fills the Pacific Ocean Yottabyte : A EARTH SIZE RICEBALL!
  • 19.
    Desktop Hobbyist The Future? Internet BigData Byte Kilobyte : onegrain of rice : cup of rice Megabyte : 8 bags of rice Gigabyte : 3 Semi trucks Terabyte Petabyte : 2 Container Ships : Blankets Manhattan Exabyte : Blankets west coast states Zettabyte : Fills the Pacific Ocean Yottabyte : A EARTH SIZE RICEBALL!
  • 20.
    Big Data isnot about the size of the data, it’s about the value within the data.
  • 21.
    We are generatinghuge amounts of data.
  • 22.
    Data with a lotof information.
  • 23.
    … and alot of noise.
  • 24.
    The ability tohear the signal from the noise is the key…
  • 25.
    to unlocking thehuman conversation that is taking place around us.
  • 26.
  • 27.
    Most people don’tknow what to do with all the data that they already have…
  • 28.
  • 30.
    Big Data isn’tbig, if you know how to use it.
  • 31.
  • 32.
    • Data startto play an increasingly important role in business and science. • Storing, searching, sharing, analysing and visualising big data has become a challenge. • Especially storing of data is often disregarded as an issue. Note that sometimes a MySQL database is not enough. • Hadoop offers an out of the box distributed filesystem for storing data files. However, the challenge appears when someone needs DB capabilities, frequent updates or real
  • 33.
    Problems Now Adays  Nowadays traditional relational databases can reach their limit in performance.  Data keep on coming in high velocity, high volumes, and high variety.  Common practices to increase performance fail after a while: buying a faster server, getting more RAM, using materialised views, fine tuning queries...  Furthermore, “alter table” doesn’t really work with lots of data. Backups and data availability becomes an issue.
  • 34.
    NO SQL • Theterm is too broad and new to really define it. • No schema • No joins between tables • No common scripting language (like SQL) • No ACID (atomicity, consistency, isolation, durability) • On the other hand you gain horizontal scalability and high performance. Also, most NoSQL systems are Map/Reduce ready and/or bind with Hadoop.
  • 35.
    MangoDB Example:- A documentis represented in JSON format: { “ id” : 12345678, “Link” : “http://news.scotsman.com/abc.html”, “Title”:“Blah blah blah”, “Content”: “More blah blah”, “OutletID” : 14, “Date” : ISODate(“2011-11-17T20:33:15.097Z”), “ Hash” : 550973592, “Tags” : [ International, News, Scotland],
  • 36.
  • 37.
    MongoDB - ShardingMongoDB If new shard is added, data is balanced automaticall
  • 38.
    Data Processing  Withoutdata processing, organizations have no access to massive amounts of data that can help them gain a competitive edge, give them insight into sales, marketing strategies and consumer needs. It is imperative that companies large and small understand the necessity of data processing.  Data processing occurs when data is collected and translated into usable information
  • 39.
    The Six Stagesof Data Processing • Data Collection • Data Preparation • Data Input • Processing • Data Output/Interpretation • Data Storage
  • 40.
    The Future ofData Processing The future of data processing lies in the cloud. Cloud technology builds on the convenience of current electronic data processing methods and accelerates its speed and effectiveness. Faster, higher-quality data means more data for each organization to utilize and more valuable insights to extract.
  • 41.
    Big data tools:- 1.Apache Hadoop 2. Microsoft HDInsight 3. NoSQL 4. Hive 5. Sqoop 7. Big data in EXCEL 8. Presto 6. PolyBase
  • 42.
    Big Data Techniques QuantitativeAnalysis Quantitative analysis is a data analysis technique that focuses on quantifying the patterns and correlations found in the data. Based on statistical practices, this technique involves analyzing a large number of observations from a dataset
  • 43.
    Qualitative Analysis Qualitative analysisis a data analysis technique that focuses on describing various data qualities using words. It involves analyzing a smaller sample in greater depth compared to quantitative data analysis. These analysis results cannot be generalized to an entire dataset due to the small sample size
  • 44.
    DATA MINING Data mining,also known as data discovery, is a specialized form of data analysis that targets large datasets. In relation to Big Data analysis, data mining generally refers to automated, software-based techniques that sift through massive datasets to identify patterns and trends.
  • 45.
    STATISTICAL ANALYSIS Statistical analysisuses statistical methods based on mathematical formulas as a means for analyzing data. Statistical analysis is most often quantitative, but can also be qualitative. This type of analysis is commonly used to describe datasets via summarization, such as providing the mean, median, or mode of statistics associated with the dataset.
  • 46.
    MACHINE LEARNING Humans aregood at spotting patterns and relationships within data. Unfortunately, we cannot process large amounts of data very quickly. Machines, on the other hand, are very adept at processing large amounts of data quickly, but only if they know how.
  • 47.
    SEMANTIC ANALYSIS A fragmentof text or speech data can carry different meanings in different contexts, whereas a complete sentence may retain its meaning, even if structured in different ways. In order for the machines to extract valuable information, text and speech data needs to be understood by the machines in the same way as humans do. Semantic analysis represents practices for extracting meaningful information from textual and speech data.
  • 48.
    VISUAL ANALYSIS Visual analysisis a form of data analysis that involves the graphic representation of data to enable or enhance its visual perception. Based on the premise that humans can understand and draw conclusions from graphics more quickly than from text, visual analysis acts as a discovery tool in the field of Big Data.