​Presto SQL Engine: what’s new?
​Strata Hadoop 2016 San Jose, CA
2
What is Presto?
100% open source distributed SQL query engine
Originally developed by Facebook
Key Differentiators:
Performance & Scale
Cross platform query capability, not only SQL on Hadoop
Apache licensed, hosted on GitHub
Certified distro & support from Teradata
3
Brief history of Presto
FALL 2012
6 developers
start Presto
development
FALL 2014
88 Releases
41 Contributors
3943 Commits
SPRING 2016
141 Releases
116 Contributors
6879 Commits
SPRING 2013
Presto rolled out
within Facebook
FALL 2013
Facebook open
sources Presto
FALL 2008
Facebook open
sources Hive
4
• Facebook
– Multiple production clusters (100s of nodes total)
- Massive 300PB Hadoop data warehouse
- Very large sharded MySQL installation
- Growing usage of Raptor SSD-based storage
– 1000s of internal daily active users
– 10-100s of concurrent queries
• Netflix
– Over 200-node production cluster on EC2
– Over 25 PB in S3 (Parquet format)
– Over 350 active users and 3K queries daily
Presto in Production
5
Presto Architecture
Data stream API
Worker
Data stream API
Worker
Coordinator
Metadata
API
Parser/
analyzer
Planner Scheduler
Worker
Client
Data location
API
Pluggable
6
Presto Extensibility – connectors
Parser/
analyzer
Planner
Worker
Data location API
HDFS/S3
NoSQL
DBMS
Custom
…
Metadata API
HDFS/S3
NoSQL
DBMS
Custom
…
Data stream API
HDFS/S3
NoSQL
DBMS
Custom
…
Scheduler
Coordinator
7
• Hadoop/Hive connector & file formats:
– HDFS & S3 + HCatalog
– ORC, RCFile, Parquet, SequenceFile, Text
• Open source data stores:
– MySQL & PostgreSQL (non-parallel)
– Cassandra
– Kafka
– Redis
• In development by community:
– MongoDB
– ElasticSearch
– HBase
Supported data sources & file formats
8
• In-memory processing
• Pipelined execution across nodes MPP-style
• Vectorized columnar processing
• Multithreaded execution keeps all CPU cores busy
• Presto is written in highly tuned Java
– Efficient flat-memory data structures (minimizes GC)
– Very careful coding of inner loops
– Runtime bytecode generation
• Optimized ORC & Parquet readers
• Excellent performance with interactive SQL analytics
Presto – Query Execution Performance
9
[ WITH with_query [, ...] ]
SELECT [ ALL | DISTINCT ] select_expr [, ...]
[ FROM table1 [[ INNER | OUTER ] JOIN table2 ON (…)]
[ WHERE condition ]
[ GROUP BY expression [, ...] ]
[ HAVING condition]
[ UNION [ ALL | DISTINCT ] select ]
[ ORDER BY expression [ ASC | DESC ] [, ...] ]
[ LIMIT [ count | ALL ] ]
In addition:
• Windowing functions
• Statistical and approximate aggregate functions
• UNNEST, TABLESAMPLE
In development:
• Complex subqueries
• EXISTS, INTERSECT, EXCEPT
• ROLLUP, CUBE
ANSI SQL Support
10
• Cluster deployment models for Presto:
– on premise (appliance or commodity clusters)
– VM (OpenStack, etc.)
– cloud (Amazon, etc)
• Types of Hadoop deployments:
– on Hadoop/YARN cluster (all or subset of nodes)
– on a dedicated cluster
– mixed
Deployment models
11
Open source initiative
• Announced in June 2015 at Hadoop Summit
– Growing interest and adoption
• Collaboration with Facebook and Presto community
– Joint development, conference talks, meetups and webinars
• Major commitment from Teradata Labs:
– 20 full-time engineers
– Free and open source contributions
– Enterprise-ready distribution
"A special shout out goes to Teradata — which joined the Presto community this year
with a focus on enhancing enterprise features and providing support — for having
seven of our top 10 external contributors."
— Facebook
12
Implement Integrate Proliferate
• Installer
• Documentation
• Monitoring & Support
Tools
• Management Tool
Integration
• YARN Integration
• ODBC Driver
• JDBC Driver
• BI Certification
• Security
• Cloud features
Commercial Support
Phase 1 Phase 2 Phase 3
June 8, 2015 Q4 2015 2016
Expanding ANSI SQL Coverage
Teradata Contributions to Presto
13
Recent developments and roadmap
• Q1 release:
– Fully-featured ODBC & JDBC drivers
– Kerberos support
– DECIMAL support
• Later 2016:
– BI tools certification
– TPC-H and TPC-DS unmodified
– Spill to disk
14
BI Tools certifications
15
Presto
Connectors
Teradata Certified Community Supported
Teradata QueryGrid™ - Multi-System Analytics
Targets
Entry Points
TERADATA
DATABASE
ASTER
ANALYTICS
PRESTO
HADOOP
HIVE /
HDFS
HADOOP
OTHER
DATABASE
S
NOSQL
DATABASE
S
TERADATA
DATABASE
ASTER
ANALYTICS
PRESTO
HADOOP
Non-Relational DBsMulti-Genre
Advanced Analytics™
Integrated Data
Warehouses
3rd Party Relational
DBs
Multiple Hadoop SQL Query
Engines and Distributions
APACHE
KAFKA
APACHE
CASSANDRA
MYSQL POSTGRESQL PRESTO APIREDIS
16
Certified Distro: www.teradata.com/presto
Website: www.prestodb.io
Presto Users Group: www.groups.google.com/group/presto-users
GitHub:
www.github.com/prestodb/presto
www.github.com/Teradata/presto
www.github.com/prestodb
More information
17
www.teradata.com/presto

Presto Strata Hadoop SJ 2016 short talk

  • 1.
    ​Presto SQL Engine:what’s new? ​Strata Hadoop 2016 San Jose, CA
  • 2.
    2 What is Presto? 100%open source distributed SQL query engine Originally developed by Facebook Key Differentiators: Performance & Scale Cross platform query capability, not only SQL on Hadoop Apache licensed, hosted on GitHub Certified distro & support from Teradata
  • 3.
    3 Brief history ofPresto FALL 2012 6 developers start Presto development FALL 2014 88 Releases 41 Contributors 3943 Commits SPRING 2016 141 Releases 116 Contributors 6879 Commits SPRING 2013 Presto rolled out within Facebook FALL 2013 Facebook open sources Presto FALL 2008 Facebook open sources Hive
  • 4.
    4 • Facebook – Multipleproduction clusters (100s of nodes total) - Massive 300PB Hadoop data warehouse - Very large sharded MySQL installation - Growing usage of Raptor SSD-based storage – 1000s of internal daily active users – 10-100s of concurrent queries • Netflix – Over 200-node production cluster on EC2 – Over 25 PB in S3 (Parquet format) – Over 350 active users and 3K queries daily Presto in Production
  • 5.
    5 Presto Architecture Data streamAPI Worker Data stream API Worker Coordinator Metadata API Parser/ analyzer Planner Scheduler Worker Client Data location API Pluggable
  • 6.
    6 Presto Extensibility –connectors Parser/ analyzer Planner Worker Data location API HDFS/S3 NoSQL DBMS Custom … Metadata API HDFS/S3 NoSQL DBMS Custom … Data stream API HDFS/S3 NoSQL DBMS Custom … Scheduler Coordinator
  • 7.
    7 • Hadoop/Hive connector& file formats: – HDFS & S3 + HCatalog – ORC, RCFile, Parquet, SequenceFile, Text • Open source data stores: – MySQL & PostgreSQL (non-parallel) – Cassandra – Kafka – Redis • In development by community: – MongoDB – ElasticSearch – HBase Supported data sources & file formats
  • 8.
    8 • In-memory processing •Pipelined execution across nodes MPP-style • Vectorized columnar processing • Multithreaded execution keeps all CPU cores busy • Presto is written in highly tuned Java – Efficient flat-memory data structures (minimizes GC) – Very careful coding of inner loops – Runtime bytecode generation • Optimized ORC & Parquet readers • Excellent performance with interactive SQL analytics Presto – Query Execution Performance
  • 9.
    9 [ WITH with_query[, ...] ] SELECT [ ALL | DISTINCT ] select_expr [, ...] [ FROM table1 [[ INNER | OUTER ] JOIN table2 ON (…)] [ WHERE condition ] [ GROUP BY expression [, ...] ] [ HAVING condition] [ UNION [ ALL | DISTINCT ] select ] [ ORDER BY expression [ ASC | DESC ] [, ...] ] [ LIMIT [ count | ALL ] ] In addition: • Windowing functions • Statistical and approximate aggregate functions • UNNEST, TABLESAMPLE In development: • Complex subqueries • EXISTS, INTERSECT, EXCEPT • ROLLUP, CUBE ANSI SQL Support
  • 10.
    10 • Cluster deploymentmodels for Presto: – on premise (appliance or commodity clusters) – VM (OpenStack, etc.) – cloud (Amazon, etc) • Types of Hadoop deployments: – on Hadoop/YARN cluster (all or subset of nodes) – on a dedicated cluster – mixed Deployment models
  • 11.
    11 Open source initiative •Announced in June 2015 at Hadoop Summit – Growing interest and adoption • Collaboration with Facebook and Presto community – Joint development, conference talks, meetups and webinars • Major commitment from Teradata Labs: – 20 full-time engineers – Free and open source contributions – Enterprise-ready distribution "A special shout out goes to Teradata — which joined the Presto community this year with a focus on enhancing enterprise features and providing support — for having seven of our top 10 external contributors." — Facebook
  • 12.
    12 Implement Integrate Proliferate •Installer • Documentation • Monitoring & Support Tools • Management Tool Integration • YARN Integration • ODBC Driver • JDBC Driver • BI Certification • Security • Cloud features Commercial Support Phase 1 Phase 2 Phase 3 June 8, 2015 Q4 2015 2016 Expanding ANSI SQL Coverage Teradata Contributions to Presto
  • 13.
    13 Recent developments androadmap • Q1 release: – Fully-featured ODBC & JDBC drivers – Kerberos support – DECIMAL support • Later 2016: – BI tools certification – TPC-H and TPC-DS unmodified – Spill to disk
  • 14.
  • 15.
    15 Presto Connectors Teradata Certified CommunitySupported Teradata QueryGrid™ - Multi-System Analytics Targets Entry Points TERADATA DATABASE ASTER ANALYTICS PRESTO HADOOP HIVE / HDFS HADOOP OTHER DATABASE S NOSQL DATABASE S TERADATA DATABASE ASTER ANALYTICS PRESTO HADOOP Non-Relational DBsMulti-Genre Advanced Analytics™ Integrated Data Warehouses 3rd Party Relational DBs Multiple Hadoop SQL Query Engines and Distributions APACHE KAFKA APACHE CASSANDRA MYSQL POSTGRESQL PRESTO APIREDIS
  • 16.
    16 Certified Distro: www.teradata.com/presto Website:www.prestodb.io Presto Users Group: www.groups.google.com/group/presto-users GitHub: www.github.com/prestodb/presto www.github.com/Teradata/presto www.github.com/prestodb More information
  • 17.