11
Presto - Analytical Database
Wojciech Biela
Łukasz Osipiuk
https://prestodb.io
2
Who are we?
Center for Hadoop
3
History of Presto
FALL 2012
6 developers
start Presto
development
FALL 2014
88 Releases
41 Contributors
3943 Commits
FALL 2015
132 Releases
105 Contributors
6300 Commits
---------
Teradata part of
Presto community
& offers support
SPRING 2013
Presto rolled out
within Facebook
FALL 2013
Facebook open
sources Presto
FALL 2008
Facebook open
sources Hive
4
➔ 100% open source distributed ANSI SQL engine for Big Data
➔ Optimized for low latency, Interactive querying
◆ Cross platform query capability, not only SQL on Hadoop
◆ Distributed under the Apache license, now supported by Teradata
◆ Used by a community of well known, well respected technology companies
◆ Modern code base
◆ Proven scalability
What is Presto?
5
High level architecture
Data stream API
Worker
Data stream API
Worker
Coordinator
Metadata
API
Parser/
analyzer Planner Scheduler
Worker
Client
Data location
API
Pluggable
6
Plan execution
Hive Presto
map
reduce
I/O
I/O
I/O
I/O
I/O
task task
task task
task task
task
I/O
7
Presto Extensibility – connector interfaces
Parser/
analyzer Planner
Worker
Data location API
Hive
Cassandra
Kafka
MySQL
…
Metadata API
Hive
Cassandra
Kafka
MySQL
…
Data stream API
Hive
Cassandra
Kafka
MySQL
…
Scheduler
Coordinator
8
Presto Extensibility – plugins
➔ Connectors
➔ Data types
➔ Extra functions
➔ Security providers
9
➔ Facebook
◆ Multiple production clusters (100s of nodes total)
● Including 300PB Hadoop data warehouse
● Single cluster size order of 10s of nodes
◆ 1000s of internal daily active users
◆ Millions of queries each month
◆ Multiple PBs scanned every day
◆ Trillions of rows a day
◆ ORC format
➔ Netflix
◆ Over 250-node production cluster on EC2
◆ Over 15 PB in S3 (Parquet format)
◆ Over 300 users and 2.5K queries daily
◆ presto-cli, R, Python, BI tools
◆ 50% queries under 4s
Some usage facts
10
Netflix Data Pipeline
Suro / Kafka Cassandra
AegisthusUrsula
Amazon S3
TVs mobile laptop
dimensionsevents
TD
TVs mobile laptopTVs mobile laptop
11
Presto use-cases at Facebook
➔ three use cases
◆ Data warehouse - big data
◆ User facing - small data
◆ User facing - medium data
12
Presto use-cases at Facebook (data warehouse)
HDFS data warehouse
13
Presto use-cases at Facebook (data warehouse)
➔ Multiple clusters
➔ O(103
) of users
➔ O(106
) queries per month
➔ petabytes of data scanned every day
➔ 100s of concurrent queries
14
Presto use-cases at Facebook (data warehouse)
Loader
Client
Presto
Data Node
Presto
Data Node
M/R
Data Node
M/R
Data Node
Presto
Data Node
Presto
Hive
15
Presto use-cases at Facebook (data warehouse)
Client
Presto
Presto
Dispatcher
Presto
Presto
Presto
Presto
Presto
16
Presto use-cases at Facebook (realtime)
Real time user facing
17
Presto use-cases at Facebook (realtime)
Requirements
➔ User facing
➔ 0.1-5 seconds latency
➔ Support for data updates
➔ highly available
➔ 10-15 way joins
18
Presto use-cases at Facebook (realtime)
Loader
Client
mysql
Presto
Presto
Presto
mysql
mysql
mysql
mysql
19
Presto use-cases at Facebook (semi realtime)
Requirements
➔ Large data sets (smaller than warehouse)
➔ seconds to minutes latency
➔ predictable performance
➔ 5-15 minutes load latency
➔ 100s concurrent queries
20
Presto use-cases at Facebook (semi realtime)
Raptor
21
Presto use-cases at Facebook (semi realtime)
Raptor
Loader
Client
Presto
Flash
Presto
Flash
Presto
Flash
Presto
Flash
Presto
mysql
Kafka
Kafka
Kafka
Kafka
Loader
Gluster
Gluster
backup tier
22
Presto use-cases at Facebook (semi realtime)
Raptor
Loader
Client
Presto
Flash
Presto
Flash
Presto
Flash
Presto
Flash
Presto
mysql
Kafka
Kafka
Kafka
Kafka
Loader
Gluster
Gluster
backup tier
INSERT INTO raptor_table SELECT *
from kafka_table where token
BETWEEN ${last_token} AND
${next_token}
MARK LOAD in
PROGRESS in MySQL
23
Presto use-cases at Facebook (semi realtime)
Extra features
➔ Physical data reorganization
➔ Fully fledged and atomic DDL
➔ Atomic data loading
➔ Tiered architecture
24
➔ Data stays in memory during execution and is pipelined across nodes MPP-
style
➔ Vectorized columnar processing
➔ Presto is written in highly tuned Java
◆ Efficient in-memory data structures
◆ Very careful coding of inner loops
◆ Bytecode generation
➔ Optimized ORC reader
➔ Predicates push-down
➔ Query optimizer
Presto = Performance
25
www.github.com/facebook/presto
www.github.com/prestodb
Certified Distro: www.teradata.com/presto
Website: www.prestodb.io
Presto : User’s Group: www.groups.google.com/group/presto-users
Interested in joining Teradata?
● Presto development
● other Hadoop related development and consulting
contact our Recruitment Partner: Renata Rosłoniec (VBC)
tel. 514 035 237, renata.rosloniec@vbconsulting.pl
How can I contribute?
26
Wojciech.Biela@teradata.com
Lukasz.Osipiuk@teradata.com

Presto - Analytical Database. Overview and use cases.

  • 1.
    11 Presto - AnalyticalDatabase Wojciech Biela Łukasz Osipiuk https://prestodb.io
  • 2.
  • 3.
    3 History of Presto FALL2012 6 developers start Presto development FALL 2014 88 Releases 41 Contributors 3943 Commits FALL 2015 132 Releases 105 Contributors 6300 Commits --------- Teradata part of Presto community & offers support SPRING 2013 Presto rolled out within Facebook FALL 2013 Facebook open sources Presto FALL 2008 Facebook open sources Hive
  • 4.
    4 ➔ 100% opensource distributed ANSI SQL engine for Big Data ➔ Optimized for low latency, Interactive querying ◆ Cross platform query capability, not only SQL on Hadoop ◆ Distributed under the Apache license, now supported by Teradata ◆ Used by a community of well known, well respected technology companies ◆ Modern code base ◆ Proven scalability What is Presto?
  • 5.
    5 High level architecture Datastream API Worker Data stream API Worker Coordinator Metadata API Parser/ analyzer Planner Scheduler Worker Client Data location API Pluggable
  • 6.
  • 7.
    7 Presto Extensibility –connector interfaces Parser/ analyzer Planner Worker Data location API Hive Cassandra Kafka MySQL … Metadata API Hive Cassandra Kafka MySQL … Data stream API Hive Cassandra Kafka MySQL … Scheduler Coordinator
  • 8.
    8 Presto Extensibility –plugins ➔ Connectors ➔ Data types ➔ Extra functions ➔ Security providers
  • 9.
    9 ➔ Facebook ◆ Multipleproduction clusters (100s of nodes total) ● Including 300PB Hadoop data warehouse ● Single cluster size order of 10s of nodes ◆ 1000s of internal daily active users ◆ Millions of queries each month ◆ Multiple PBs scanned every day ◆ Trillions of rows a day ◆ ORC format ➔ Netflix ◆ Over 250-node production cluster on EC2 ◆ Over 15 PB in S3 (Parquet format) ◆ Over 300 users and 2.5K queries daily ◆ presto-cli, R, Python, BI tools ◆ 50% queries under 4s Some usage facts
  • 10.
    10 Netflix Data Pipeline Suro/ Kafka Cassandra AegisthusUrsula Amazon S3 TVs mobile laptop dimensionsevents TD TVs mobile laptopTVs mobile laptop
  • 11.
    11 Presto use-cases atFacebook ➔ three use cases ◆ Data warehouse - big data ◆ User facing - small data ◆ User facing - medium data
  • 12.
    12 Presto use-cases atFacebook (data warehouse) HDFS data warehouse
  • 13.
    13 Presto use-cases atFacebook (data warehouse) ➔ Multiple clusters ➔ O(103 ) of users ➔ O(106 ) queries per month ➔ petabytes of data scanned every day ➔ 100s of concurrent queries
  • 14.
    14 Presto use-cases atFacebook (data warehouse) Loader Client Presto Data Node Presto Data Node M/R Data Node M/R Data Node Presto Data Node Presto Hive
  • 15.
    15 Presto use-cases atFacebook (data warehouse) Client Presto Presto Dispatcher Presto Presto Presto Presto Presto
  • 16.
    16 Presto use-cases atFacebook (realtime) Real time user facing
  • 17.
    17 Presto use-cases atFacebook (realtime) Requirements ➔ User facing ➔ 0.1-5 seconds latency ➔ Support for data updates ➔ highly available ➔ 10-15 way joins
  • 18.
    18 Presto use-cases atFacebook (realtime) Loader Client mysql Presto Presto Presto mysql mysql mysql mysql
  • 19.
    19 Presto use-cases atFacebook (semi realtime) Requirements ➔ Large data sets (smaller than warehouse) ➔ seconds to minutes latency ➔ predictable performance ➔ 5-15 minutes load latency ➔ 100s concurrent queries
  • 20.
    20 Presto use-cases atFacebook (semi realtime) Raptor
  • 21.
    21 Presto use-cases atFacebook (semi realtime) Raptor Loader Client Presto Flash Presto Flash Presto Flash Presto Flash Presto mysql Kafka Kafka Kafka Kafka Loader Gluster Gluster backup tier
  • 22.
    22 Presto use-cases atFacebook (semi realtime) Raptor Loader Client Presto Flash Presto Flash Presto Flash Presto Flash Presto mysql Kafka Kafka Kafka Kafka Loader Gluster Gluster backup tier INSERT INTO raptor_table SELECT * from kafka_table where token BETWEEN ${last_token} AND ${next_token} MARK LOAD in PROGRESS in MySQL
  • 23.
    23 Presto use-cases atFacebook (semi realtime) Extra features ➔ Physical data reorganization ➔ Fully fledged and atomic DDL ➔ Atomic data loading ➔ Tiered architecture
  • 24.
    24 ➔ Data staysin memory during execution and is pipelined across nodes MPP- style ➔ Vectorized columnar processing ➔ Presto is written in highly tuned Java ◆ Efficient in-memory data structures ◆ Very careful coding of inner loops ◆ Bytecode generation ➔ Optimized ORC reader ➔ Predicates push-down ➔ Query optimizer Presto = Performance
  • 25.
    25 www.github.com/facebook/presto www.github.com/prestodb Certified Distro: www.teradata.com/presto Website:www.prestodb.io Presto : User’s Group: www.groups.google.com/group/presto-users Interested in joining Teradata? ● Presto development ● other Hadoop related development and consulting contact our Recruitment Partner: Renata Rosłoniec (VBC) tel. 514 035 237, renata.rosloniec@vbconsulting.pl How can I contribute?
  • 26.