Analytics at Scale with Apache Spark on AWS with Jonathan Fritz

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Jonathan Fritz, Amazon EMR
June 6, 2017
Analytics at Scale with
Apache Spark on AWS

Agenda
• Integration with Amazon S3 and other AWS services
• Lower costs Amazon EC2 Spot instances and Auto Scaling
• Spark Security Tips
• Customer Stories

What is Amazon EMR?
Low Cost
Pay an hourly rate
Open-Source Variety
Latest versions of software
Managed
Spend less time monitoring
Secure
Easy to manage options
Flexible
Customize the cluster
Easy to Use
Launch a cluster in minutes

Many storage layers to choose from
Amazon DynamoDB
Amazon RDS Amazon Kinesis
Amazon Redshift
Amazon S3
Amazon EMR

Spot for
task nodes
Up to 80%
off EC2
On-Demand
pricing
On-demand for
core nodes
Standard
Amazon EC2
pricing for
on-demand
capacity
Use Spot and Reserved Instances to lower costs
Meet SLA at predictable cost Exceed SLA at lower cost

Instance fleets for advanced Spot provisioning
Master Node Core Instance Fleet Task Instance Fleet
• Provision from a list of instance types with Spot and On-Demand
• Launch in the most optimal Availability Zone based on capacity/price
• Spot Block support

Security – Authentication and Authorization
Tag: user = MyUserIAM user: MyUser
EMR role
EC2 role
SSH key
Application authN

Learn
Models
ModelsImpressions
Clicks
Activities
Calibrate
Evaluate
Real
Time
Bidding
S3
ETL Attribution
Machine
Learning
S3Amazon
Kinesis
• 2 Petabytes Processed Daily
• 2 Million Bid Decisions Per Second
• Runs 24 X 7 on 5 Continents
• Thousands of ML Models
Trained per Day

DataXu Workflow
CDN
Real Time
Bidding
Retargeting
Platform
Amazon
Kinesis
Attribution & ML
S3
Reporting
Data Visualization
Data
Pipeline
ETL(Spark SQL)
Event Data
• Impressions
• Activities
• Attributions
• (Facts)
Reference Data
(Dimensions)
Application Logs
Exceptions Data
Reporting Data
Zeppelin notebooks

Architecture
RECOMMENDATION API
(Python, R, Flask)
Zillow Group
Data Lake
(S3 / Kinesis)
Property Featurization
(Spark EMR)
User Profiles
(Spark EMR)
Ranking
(Spark EMR)
Wedge Counting
Collaborative Filtering
(Spark EMR)
Property Aggregate Features
(Spark EMR)
Data Collection Systems
(Java/Python/SQL)

Training & scoring
Collect user behavior and real-estate data, train the various
models, generate the candidate set, and make predictions.
User
Behavior
(Kinesis
/S3)
Public
Record
(Kinesis
/ S3)
Event API
(Java)
Producer
(Python)
Filter
(Spark)
User Store
(Hive / S3)
Spark job creates Hive
table with user events
(uid, pid) partitioned
by date
Active
Listings
(Kinesis
/ S3)
Producer
(Python)
Training Data
(Spark)
Training Set
(Hive / S3)
pid -> uid reverse index
Past and current
user events
Models
(Python)
Train Models
(Spark)
Score
(Spark)
Recommendations
Property Data
Collaborative Filtering
/ User Profile Models
Hashmap
(Redis)
Wedge features or property
features (user profile)

Ad hoc environment
Scale cluster to accommodate more users

Some of our customers running Spark on EMR
Internet of things
(IOT)

Thank you!
jonfritz@amazon.com
aws.amazon.com/emr
aws.amazon.com/blogs/big-data/

Analytics at Scale with Apache Spark on AWS with Jonathan Fritz

More Related Content

What's hot

Similar to Analytics at Scale with Apache Spark on AWS with Jonathan Fritz

More from Databricks

Recently uploaded

Analytics at Scale with Apache Spark on AWS with Jonathan Fritz