Big Data Analytics
What Is Big Data Analytics?
● Big Data
– Buzz word
– Two definitions:
● Data sets too large for modern relational databases
● Semi-structured/Unstructured data sets
● Analytics
– The science of measuring and discovering patterns
and trends with data
Source: http://www.socialtalent.co/blog/big-data-whats-the-big-deal
Data, Data, Everywhere...
● In 2004:
– Internet traffic: 1 Exabyte (that's 134,217,728 8GB
flash drives)
– A lot of other media:
● Newspapers/books/magazines
● DVDs
Data, Data, Everywhere...
● Today:
– Internet traffic: 1.3 Zettabytes (that's
178,670,639,360 8 GB sticks)
● 110.3 exabytes per month
– Even more media:
● Mobile devices (phones/tablets/mp3 players/etc)
● The Internet of Things
● Streaming Media
The Internet of Things
● How many of you have...
– Fitness trackers?
– E-readers?
– Ipods?
● Tie them to social sites (i.e. Facebook)?
The Internet of Things
● You're being tracked!
● So what?
– Marketing
– Medical
– Government
● Building fuller picture of what's tracked.
Social Network Integration
Six Degrees of Separation
Source: http://www.83toinfinity.com
Source: http://www.math.cornell.edu/~numb3rs/blanco/social_net.jpg
Data Storage
Data Storage
● Relational Databases
– Structured data
– Can scale to huge volumes of data
● Hadoop
– Semi-structured/unstructured data
– Massively parallel storage and processing
Relational Database
Source: http://www.ntu.edu.sg/home/ehchua/programming/sql/images/ManyToOne.png
Unstructured Data
Source: http://storagegaga.com/2011/12/
Semi-structured
Source: http://www.stylusstudio.com/images/figures/sql_xml_xml_fragment.gif
What Solution to Pick?
● Data Volume and Speed
– Relational Databases Will Cap out
– ”Big Data” Stores Scale (For Now)
● Hadoop
● Spark
● Lucene
– Alternative Modeling Techniques
● Hyper Normalized (6-8NF)
– Inmon's Textual Disambiguation
– Anchor Modeling
– Data Vault
Hadoop
● Version 1
– Giant data store
– File distribution
– File parsing tools
– Generic security
● Version 2
– Giant data store
– Replaced foundation work
– Unified security -LDAP/Kerberos support
Tools
● Oozie
● Hive
● NoSQL Databases
– Hbase
– MongoDB
JSON
{
"employees": [
{ "firstName":"John" , "lastName":"Doe" },
{ "firstName":"Anna" , "lastName":"Smith" },
{ "firstName":"Peter" , "lastName":"Jones" }
]
}
Source: http://www.w3schools.com/json/json_syntax.asp
How to Analyze?
● Performance
● Timeliness
● Accuracy
● Feedback
“Big Data” Solutions
● Search the entire data set
● Great performance
● Highly accurate
● Integrates into Analytics tools
– Only some of the tools are able to support Hadoop,
etc.
Statistics
● Designed for all sizes of data sets
● Decreases time to results
● As accurate as needed
● Analytics tools fully support
● Most “Big Data” tools support
Analytics Tools
● Can access data of most sizes
– Most can handle Hadoop and some NoSQL
databases
● Built for Predictive Modeling
● Starting to handle social/network modeling
How to Get Started
● Grab some tools!
– RapidMiner (http://rapidminer.com/)
– R (http://www.r-project.org/)
– Weka (http://www.cs.waikato.ac.nz/ml/weka/)
● Grab some data!
– http://www.kdnuggets.com/datasets/index.html
– http://aws.amazon.com/publicdatasets/
– http://www.reddit.com/r/datasets
Prizes/Challenges
● Kaggle - https://www.kaggle.com/
● MIT - http://bigdata.csail.mit.edu/challenge
● Heritage Health Prize -
http://www.heritagehealthprize.com/c/hhp
● Twitter -
@OpenDataAlex
● LinkedIn –
alexmeadows
● Github - dbaAlex
Questions? Comments?

Big Data Analytics - Introduction