NoSQL and SQL Work Side-by-Side
to Tackle Real-time Big Data Needs
Allen Day
MapR Technologies
Me
• Allen Day
– Principal Data Scientist @ MapR
– Human Genomics / Bioinformatics
(PhD, UCLA School of Medicine)
• @allenday
• allenday@allenday.com
• aday@maprtech.com
You
• I’m assuming that the typical attendee:
– is a software developer
– is interested and familiar with open source
– is familiar with Hadoop, relational DBs
– has heard of or has used some NoSQL technology
Big Data Workloads
• Offline
– ETL
– Model creation & clustering & indexing
– Web Crawling
– Batch reporting
• Online
– Lightweight OLTP
– Classification & anomaly detection
– Stream processing
– Interactive reporting
SQL
What is NoSQL? Why use it?
• Traditional storage (relational DBs) are unable to
accommodate increasing # and variety of
observations
– Culprits: sensors, event logs, electronic payments
• Solution: stay responsive by relaxing ACID storage
requirements
– Denormalize (#)
– Loosen schema (variety), loosen consistency
• This is the essence of NoSQL
NoSQL Impact on Business Processes
• Traditional business intelligence (BI) tech stack
assumes relational DB storage
– Company decisions depend on this (reports, charts)
• NoSQL collected data aren’t in relational DB
– Data volume/variety is still increasing
– Tech and methods are still in flux
• Decoupled data storage and decision support
systems
– BI can’t access freshest, largest data sets
– Very high opportunity cost to business
Ideal Solution Features
• Scalable & Reliable
– Distributed replicated storage
– Distributed parallel processing
• BI application support
– Ad-hoc, interactive queries
– Real-time responsiveness
• Flexible
– Handles rapid storage and schema evolution
– Handles new analytics methods and functions
Hadoop FS
Map/Reduce, YARN{
SQL Interface{
Extensible for NoSQL,
Advanced Analytics{
From Ideals to Possibilities
• Migrate NoSQL data/processing to SQL
– High cost to marshal NoSQL data to SQL storage
– SQL systems lack advanced analytics capabilities
• Migrate SQL data to NoSQL
– Breaks compatibility for BI-dependent functions, e.g.
financial reporting
– Limited support for relational operations (joins)
• high latency
– NoSQL tech is still in flux (continuity)
• Other Approaches?
– Yes. First let’s consider a SQL/NoSQL use case
Impala
Interactive Queries & Hadoop
low-latency
Example Problem: Marketing Campaign
• Jane is an analyst at an
e-commerce company
• How does she figure
out good targeting
segments for the next
marketing campaign?
• She has some ideas…
…and lots of data
User
profiles
Transaction
information
Access
logs
Traditional System Solution 1: RDBMS
• ETL the data from
MongoDB and Hadoop
into the RDBMS
– MongoDB data must be
flattened, schematized,
filtered and aggregated
– Hadoop data must be
filtered and aggregated
• Query the data using
any SQL-based tool
User
profiles
Access
logs
Transaction
information
Traditional System Solution 2: Hadoop
• ETL the data from
Oracle and MongoDB
into Hadoop
– MongoDB data must be
flattened and
schematized
• Work with the
MapReduce team to
write custom code to
generate the desired
analyses
User
profiles
Access
logs
Transaction
information
Traditional System Solution 3: Hive
• ETL the data from
Oracle and MongoDB
into Hadoop
– MongoDB data must be
flattened and
schematized
• But HiveQL queries are
slow and BI tool
support is limited
– Marshaling/Coding
User
profiles
Access
logs
Transaction
information
What Would Google Do?
Distributed
File System
NoSQL
Interactive
analysis
Batch
processing
GFS BigTable Dremel MapReduce
HDFS HBase ???
Hadoop
MapReduce
Build Apache Drill to provide a true open source
solution to interactive analysis of Big Data
Apache Drill Overview
• Interactive analysis of Big Data using standard
SQL
• Fast
– Low latency queries
– Complement native interfaces and
MapReduce/Hive/Pig
• Open
– Community driven open source project
– Under Apache Software Foundation
• Modern
– Standard ANSI SQL:2003 (select/into)
– Nested data support
– Schema is optional
– Supports RDBMS, Hadoop and NoSQL
Interactive queries
Data analyst
Reporting
100 ms-20 min
Data mining
Modeling
Large ETL
20 min-20 hr
MapReduce
Hive
Pig
Apache Drill
How Does It Work?
Drillbit
(Coordinator)
SQL Query
Parser
Query Planner
Drillbit
(Executor)
Drillbit
(Executor)
Drillbit
(Executor)
SELECT * FROM
oracle.transactions,
mongo.users,
hdfs.events
LIMIT 1
Drill Client
Tableau
Drill ODBC Driver
Micro-
Strategy
Crystal
Reports
Driver
How Does It Work?
• Drillbits run on each node, designed to
maximize data locality
• Processing is done outside MapReduce
paradigm (but possibly within YARN)
• Queries can be fed to any Drillbit
• Coordination, query planning, optimization,
scheduling, and execution are distributed
SELECT * FROM
oracle.transactions,
mongo.users,
hdfs.events
LIMIT 1
Apache Drill: Key Features
• Full ANSI SQL:2003 support
– Use any SQL-based tool
• Nested data support
– Flattening is error-prone and often impossible
• Schema-less data source support
– Schema can change rapidly and may be record-specific
• Extensible
– DSLs, UDFs
– Custom operators (e.g. k-means clustering)
– Well-documented data source & file format APIs
How Does Impala Fit In?
Impala Strengths
• Beta currently available
• Easy install and setup on top of
Cloudera
• Faster than Hive on some queries
• SQL-like query language
Questions
• Open Source ‘Lite’
• Lacks RDBMS support
• Lacks NoSQL support beyond
HBase
• Early row materialization
increases footprint and reduces
performance
• Limited file format support
• Query results must fit in memory!
• Rigid schema is required
• No support for nested data
• SQL-like (not SQL)
Many important features are “coming soon”.
Architectural foundation is constrained. No community development.
Drill Status: Alpha Available July
• Heavy active development by multiple organizations
– Contributors from Oracle, IBM Netezza, Informatica, Clustrix, Pentaho
• Available
– Logical plan syntax and interpreter
– Reference interpreter
• In progress
– SQL interpreter
– Storage engine implementations for Accumulo, Cassandra, HBase and
various file formats
• Significant community momentum
– Over 200 people on the Drill mailing list
– Over 200 members of the Bay Area Drill User Group
– Drill meetups across the US and Europe
• Beta: Q3
Why Apache Drill Will Be Successful
Resources
• Contributors have
strong backgrounds
from companies like
Oracle, IBM Netezza,
Informatica, Clustrix
and Pentaho
Community
• Development done in
the open
• Active contributors
from multiple
companies
• Rapidly growing
Architecture
• Full SQL
• New data support
• Extensible APIs
• Full Columnar
Execution
• Beyond Hadoop
Bottom Line: Apache Drill enables NoSQL and SQL
Work Side-by-Side to Tackle Real-time Big Data Needs
Me
• Allen Day
– Principal Data Scientist @ MapR
• @allenday
• allenday@allenday.com
• aday@maprtech.com
ADDITIONAL SLIDES
Full SQL (ANSI SQL:2003)
• Drill supports SQL (ANSI SQL:2003 standard)
– Correlated subqueries, analytic functions, …
– SQL-like is not enough
• Use any SQL-based tool with Apache Drill
– Tableau, Microstrategy, Excel, SAP Crystal Reports, Toad, SQuirreL, …
– Standard ODBC and JDBC drivers
Drill%Worker
Drill%Worker
Driver
Client
Drillbit
SQL%Query%
Parser
Query%
Planner
Drillbits
Drill%ODBC%
Driver
Tableau
MicroStrategy
Excel
SAP%Crystal%
Reports
Nested Data
• Nested data is becoming prevalent
– JSON, BSON, XML, Protocol Buffers, Avro, etc.
– The data source may or may not be aware
• MongoDB supports nested data natively
• A single HBase value could be a JSON document
(compound nested type)
– Google Dremel’s innovation was efficient columnar
storage and querying of nested data
• Flattening nested data is error-prone and often
impossible
– Think about repeated and optional fields at every
level…
• Apache Drill supports nested data
– Extensions to ANSI SQL:2003
enum Gender {
MALE, FEMALE
}
record User {
string name;
Gender gender;
long followers;
}
{
"name": "Homer",
"gender": "Male",
"followers": 100
children: [
{name: "Bart"},
{name: "Lisa”}
]
}
JSON
Avro
Schema is Optional
• Many data sources do not have rigid schemas
– Schemas change rapidly
– Each record may have a different schema, may be sparse/wide
• Apache Drill supports querying against unknown schemas
– Query any HBase, Cassandra or MongoDB table
• User can define the schema or let the system discover it
automatically
– System of record may already have schema information
– No need to manage schema evolution
Row Key CF contents CF anchor
"com.cnn.www" contents:html = "<html>…" anchor:my.look.ca = "CNN.com"
anchor:cnnsi.com = "CNN"
"com.foxnews.www" contents:html = "<html>…" anchor:en.wikipedia.org = "Fox News"
… … …
Flexible and Extensible Architecture
• Apache Drill is designed for extensibility
• Well-documented APIs and interfaces
• Data sources and file formats
– Implement a custom scanner to support a new source/format
• Query languages
– SQL:2003 is the primary language
– Implement a custom Parser to support a Domain Specific Language
– UDFs
• Optimizers
– Drill will have a cost-based optimizer
– Clear surrounding APIs support easy optimizer exploration
• Operators
– Custom operators can be implemented (e.g. k-Means clustering)
– Operator push-down to data source (RDBMS)

No sql and sql - open analytics summit

  • 1.
    NoSQL and SQLWork Side-by-Side to Tackle Real-time Big Data Needs Allen Day MapR Technologies
  • 2.
    Me • Allen Day –Principal Data Scientist @ MapR – Human Genomics / Bioinformatics (PhD, UCLA School of Medicine) • @allenday • allenday@allenday.com • aday@maprtech.com
  • 3.
    You • I’m assumingthat the typical attendee: – is a software developer – is interested and familiar with open source – is familiar with Hadoop, relational DBs – has heard of or has used some NoSQL technology
  • 4.
    Big Data Workloads •Offline – ETL – Model creation & clustering & indexing – Web Crawling – Batch reporting • Online – Lightweight OLTP – Classification & anomaly detection – Stream processing – Interactive reporting SQL
  • 5.
    What is NoSQL?Why use it? • Traditional storage (relational DBs) are unable to accommodate increasing # and variety of observations – Culprits: sensors, event logs, electronic payments • Solution: stay responsive by relaxing ACID storage requirements – Denormalize (#) – Loosen schema (variety), loosen consistency • This is the essence of NoSQL
  • 6.
    NoSQL Impact onBusiness Processes • Traditional business intelligence (BI) tech stack assumes relational DB storage – Company decisions depend on this (reports, charts) • NoSQL collected data aren’t in relational DB – Data volume/variety is still increasing – Tech and methods are still in flux • Decoupled data storage and decision support systems – BI can’t access freshest, largest data sets – Very high opportunity cost to business
  • 7.
    Ideal Solution Features •Scalable & Reliable – Distributed replicated storage – Distributed parallel processing • BI application support – Ad-hoc, interactive queries – Real-time responsiveness • Flexible – Handles rapid storage and schema evolution – Handles new analytics methods and functions Hadoop FS Map/Reduce, YARN{ SQL Interface{ Extensible for NoSQL, Advanced Analytics{
  • 8.
    From Ideals toPossibilities • Migrate NoSQL data/processing to SQL – High cost to marshal NoSQL data to SQL storage – SQL systems lack advanced analytics capabilities • Migrate SQL data to NoSQL – Breaks compatibility for BI-dependent functions, e.g. financial reporting – Limited support for relational operations (joins) • high latency – NoSQL tech is still in flux (continuity) • Other Approaches? – Yes. First let’s consider a SQL/NoSQL use case
  • 9.
    Impala Interactive Queries &Hadoop low-latency
  • 10.
    Example Problem: MarketingCampaign • Jane is an analyst at an e-commerce company • How does she figure out good targeting segments for the next marketing campaign? • She has some ideas… …and lots of data User profiles Transaction information Access logs
  • 11.
    Traditional System Solution1: RDBMS • ETL the data from MongoDB and Hadoop into the RDBMS – MongoDB data must be flattened, schematized, filtered and aggregated – Hadoop data must be filtered and aggregated • Query the data using any SQL-based tool User profiles Access logs Transaction information
  • 12.
    Traditional System Solution2: Hadoop • ETL the data from Oracle and MongoDB into Hadoop – MongoDB data must be flattened and schematized • Work with the MapReduce team to write custom code to generate the desired analyses User profiles Access logs Transaction information
  • 13.
    Traditional System Solution3: Hive • ETL the data from Oracle and MongoDB into Hadoop – MongoDB data must be flattened and schematized • But HiveQL queries are slow and BI tool support is limited – Marshaling/Coding User profiles Access logs Transaction information
  • 14.
    What Would GoogleDo? Distributed File System NoSQL Interactive analysis Batch processing GFS BigTable Dremel MapReduce HDFS HBase ??? Hadoop MapReduce Build Apache Drill to provide a true open source solution to interactive analysis of Big Data
  • 15.
    Apache Drill Overview •Interactive analysis of Big Data using standard SQL • Fast – Low latency queries – Complement native interfaces and MapReduce/Hive/Pig • Open – Community driven open source project – Under Apache Software Foundation • Modern – Standard ANSI SQL:2003 (select/into) – Nested data support – Schema is optional – Supports RDBMS, Hadoop and NoSQL Interactive queries Data analyst Reporting 100 ms-20 min Data mining Modeling Large ETL 20 min-20 hr MapReduce Hive Pig Apache Drill
  • 16.
    How Does ItWork? Drillbit (Coordinator) SQL Query Parser Query Planner Drillbit (Executor) Drillbit (Executor) Drillbit (Executor) SELECT * FROM oracle.transactions, mongo.users, hdfs.events LIMIT 1 Drill Client Tableau Drill ODBC Driver Micro- Strategy Crystal Reports Driver
  • 17.
    How Does ItWork? • Drillbits run on each node, designed to maximize data locality • Processing is done outside MapReduce paradigm (but possibly within YARN) • Queries can be fed to any Drillbit • Coordination, query planning, optimization, scheduling, and execution are distributed SELECT * FROM oracle.transactions, mongo.users, hdfs.events LIMIT 1
  • 18.
    Apache Drill: KeyFeatures • Full ANSI SQL:2003 support – Use any SQL-based tool • Nested data support – Flattening is error-prone and often impossible • Schema-less data source support – Schema can change rapidly and may be record-specific • Extensible – DSLs, UDFs – Custom operators (e.g. k-means clustering) – Well-documented data source & file format APIs
  • 19.
    How Does ImpalaFit In? Impala Strengths • Beta currently available • Easy install and setup on top of Cloudera • Faster than Hive on some queries • SQL-like query language Questions • Open Source ‘Lite’ • Lacks RDBMS support • Lacks NoSQL support beyond HBase • Early row materialization increases footprint and reduces performance • Limited file format support • Query results must fit in memory! • Rigid schema is required • No support for nested data • SQL-like (not SQL) Many important features are “coming soon”. Architectural foundation is constrained. No community development.
  • 20.
    Drill Status: AlphaAvailable July • Heavy active development by multiple organizations – Contributors from Oracle, IBM Netezza, Informatica, Clustrix, Pentaho • Available – Logical plan syntax and interpreter – Reference interpreter • In progress – SQL interpreter – Storage engine implementations for Accumulo, Cassandra, HBase and various file formats • Significant community momentum – Over 200 people on the Drill mailing list – Over 200 members of the Bay Area Drill User Group – Drill meetups across the US and Europe • Beta: Q3
  • 21.
    Why Apache DrillWill Be Successful Resources • Contributors have strong backgrounds from companies like Oracle, IBM Netezza, Informatica, Clustrix and Pentaho Community • Development done in the open • Active contributors from multiple companies • Rapidly growing Architecture • Full SQL • New data support • Extensible APIs • Full Columnar Execution • Beyond Hadoop Bottom Line: Apache Drill enables NoSQL and SQL Work Side-by-Side to Tackle Real-time Big Data Needs
  • 22.
    Me • Allen Day –Principal Data Scientist @ MapR • @allenday • allenday@allenday.com • aday@maprtech.com
  • 24.
  • 25.
    Full SQL (ANSISQL:2003) • Drill supports SQL (ANSI SQL:2003 standard) – Correlated subqueries, analytic functions, … – SQL-like is not enough • Use any SQL-based tool with Apache Drill – Tableau, Microstrategy, Excel, SAP Crystal Reports, Toad, SQuirreL, … – Standard ODBC and JDBC drivers Drill%Worker Drill%Worker Driver Client Drillbit SQL%Query% Parser Query% Planner Drillbits Drill%ODBC% Driver Tableau MicroStrategy Excel SAP%Crystal% Reports
  • 26.
    Nested Data • Nesteddata is becoming prevalent – JSON, BSON, XML, Protocol Buffers, Avro, etc. – The data source may or may not be aware • MongoDB supports nested data natively • A single HBase value could be a JSON document (compound nested type) – Google Dremel’s innovation was efficient columnar storage and querying of nested data • Flattening nested data is error-prone and often impossible – Think about repeated and optional fields at every level… • Apache Drill supports nested data – Extensions to ANSI SQL:2003 enum Gender { MALE, FEMALE } record User { string name; Gender gender; long followers; } { "name": "Homer", "gender": "Male", "followers": 100 children: [ {name: "Bart"}, {name: "Lisa”} ] } JSON Avro
  • 27.
    Schema is Optional •Many data sources do not have rigid schemas – Schemas change rapidly – Each record may have a different schema, may be sparse/wide • Apache Drill supports querying against unknown schemas – Query any HBase, Cassandra or MongoDB table • User can define the schema or let the system discover it automatically – System of record may already have schema information – No need to manage schema evolution Row Key CF contents CF anchor "com.cnn.www" contents:html = "<html>…" anchor:my.look.ca = "CNN.com" anchor:cnnsi.com = "CNN" "com.foxnews.www" contents:html = "<html>…" anchor:en.wikipedia.org = "Fox News" … … …
  • 28.
    Flexible and ExtensibleArchitecture • Apache Drill is designed for extensibility • Well-documented APIs and interfaces • Data sources and file formats – Implement a custom scanner to support a new source/format • Query languages – SQL:2003 is the primary language – Implement a custom Parser to support a Domain Specific Language – UDFs • Optimizers – Drill will have a cost-based optimizer – Clear surrounding APIs support easy optimizer exploration • Operators – Custom operators can be implemented (e.g. k-Means clustering) – Operator push-down to data source (RDBMS)

Editor's Notes

  • #3 Emphasize previous experience in my applied domain BFX, difficulty of processing queries effectively (stratified experiments of high-dimensional genomic data).
  • #4 I’m assuming that the typical attendee of this talk is a software developer familiar with and interested in open source technologies. Is already familiar with Hadoop, relational databases, and has heard of or may have some hands-on experience working with some NosQL technologies.
  • #5 Note correspondences between offline operation and its online counterpart
  • #6 Call detail records, as we’ve been hearing about in the news around PRISM recently
  • #10 Hive: compile to MR, Aster: external tables in MPP, Oracle/MySQL: export MR results to RDBMSDrill, Impala, CitusDB: real-time
  • #23 Emphasize previous experience in my applied domain BFX, difficulty of processing queries effectively (stratified experiments of high-dimensional genomic data).