© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Jonathan Fritz, Amazon EMR
June 6, 2017
Analytics at Scale with
Apache Spark on AWS
Agenda
• Integration with Amazon S3 and other AWS services
• Lower costs Amazon EC2 Spot instances and Auto Scaling
• Spark Security Tips
• Customer Stories
What is Amazon EMR?
Low Cost
Pay an hourly rate
Open-Source Variety
Latest versions of software
Managed
Spend less time monitoring
Secure
Easy to manage options
Flexible
Customize the cluster
Easy to Use
Launch a cluster in minutes
Many storage layers to choose from
Amazon DynamoDB
Amazon RDS Amazon Kinesis
Amazon Redshift
Amazon S3
Amazon EMR
Spot for
task nodes
Up to 80%
off EC2
On-Demand
pricing
On-demand for
core nodes
Standard
Amazon EC2
pricing for
on-demand
capacity
Use Spot and Reserved Instances to lower costs
Meet SLA at predictable cost Exceed SLA at lower cost
Instance fleets for advanced Spot provisioning
Master Node Core Instance Fleet Task Instance Fleet
• Provision from a list of instance types with Spot and On-Demand
• Launch in the most optimal Availability Zone based on capacity/price
• Spot Block support
Lower costs with Auto Scaling
Security - Encryption
Security – Authentication and Authorization
Tag: user = MyUserIAM user: MyUser
EMR role
EC2 role
SSH key
Application authN
Learn
Models
ModelsImpressions
Clicks
Activities
Calibrate
Evaluate
Real
Time
Bidding
S3
ETL Attribution
Machine
Learning
S3Amazon
Kinesis
• 2 Petabytes Processed Daily
• 2 Million Bid Decisions Per Second
• Runs 24 X 7 on 5 Continents
• Thousands of ML Models
Trained per Day
DataXu Workflow
CDN
Real Time
Bidding
Retargeting
Platform
Amazon
Kinesis
Attribution & ML
S3
Reporting
Data Visualization
Data
Pipeline
ETL(Spark SQL)
Event Data
• Impressions
• Activities
• Attributions
• (Facts)
Reference Data
(Dimensions)
Application Logs
Exceptions Data
Reporting Data
Zeppelin notebooks
Architecture
RECOMMENDATION API
(Python, R, Flask)
Zillow Group
Data Lake
(S3 / Kinesis)
Property Featurization
(Spark EMR)
User Profiles
(Spark EMR)
Ranking
(Spark EMR)
Wedge Counting
Collaborative Filtering
(Spark EMR)
Property Aggregate Features
(Spark EMR)
Data Collection Systems
(Java/Python/SQL)
Training & scoring
Collect user behavior and real-estate data, train the various
models, generate the candidate set, and make predictions.
User
Behavior
(Kinesis
/S3)
Public
Record
(Kinesis
/ S3)
Event API
(Java)
Producer
(Python)
Filter
(Spark)
User Store
(Hive / S3)
Spark job creates Hive
table with user events
(uid, pid) partitioned
by date
Active
Listings
(Kinesis
/ S3)
Producer
(Python)
Training Data
(Spark)
Training Set
(Hive / S3)
pid -> uid reverse index
Past and current
user events
Models
(Python)
Train Models
(Spark)
Score
(Spark)
Recommendations
Property Data
Collaborative Filtering
/ User Profile Models
Hashmap
(Redis)
Wedge features or property
features (user profile)
Ad hoc environment
Scale cluster to accommodate more users
dataset
Some of our customers running Spark on EMR
Internet of things
(IOT)
Thank you!
jonfritz@amazon.com
aws.amazon.com/emr
aws.amazon.com/blogs/big-data/

Analytics at Scale with Apache Spark on AWS with Jonathan Fritz

  • 1.
    © 2016, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Jonathan Fritz, Amazon EMR June 6, 2017 Analytics at Scale with Apache Spark on AWS
  • 2.
    Agenda • Integration withAmazon S3 and other AWS services • Lower costs Amazon EC2 Spot instances and Auto Scaling • Spark Security Tips • Customer Stories
  • 3.
    What is AmazonEMR? Low Cost Pay an hourly rate Open-Source Variety Latest versions of software Managed Spend less time monitoring Secure Easy to manage options Flexible Customize the cluster Easy to Use Launch a cluster in minutes
  • 4.
    Many storage layersto choose from Amazon DynamoDB Amazon RDS Amazon Kinesis Amazon Redshift Amazon S3 Amazon EMR
  • 5.
    Spot for task nodes Upto 80% off EC2 On-Demand pricing On-demand for core nodes Standard Amazon EC2 pricing for on-demand capacity Use Spot and Reserved Instances to lower costs Meet SLA at predictable cost Exceed SLA at lower cost
  • 6.
    Instance fleets foradvanced Spot provisioning Master Node Core Instance Fleet Task Instance Fleet • Provision from a list of instance types with Spot and On-Demand • Launch in the most optimal Availability Zone based on capacity/price • Spot Block support
  • 7.
    Lower costs withAuto Scaling
  • 8.
  • 9.
    Security – Authenticationand Authorization Tag: user = MyUserIAM user: MyUser EMR role EC2 role SSH key Application authN
  • 10.
    Learn Models ModelsImpressions Clicks Activities Calibrate Evaluate Real Time Bidding S3 ETL Attribution Machine Learning S3Amazon Kinesis • 2Petabytes Processed Daily • 2 Million Bid Decisions Per Second • Runs 24 X 7 on 5 Continents • Thousands of ML Models Trained per Day
  • 11.
    DataXu Workflow CDN Real Time Bidding Retargeting Platform Amazon Kinesis Attribution& ML S3 Reporting Data Visualization Data Pipeline ETL(Spark SQL) Event Data • Impressions • Activities • Attributions • (Facts) Reference Data (Dimensions) Application Logs Exceptions Data Reporting Data Zeppelin notebooks
  • 12.
    Architecture RECOMMENDATION API (Python, R,Flask) Zillow Group Data Lake (S3 / Kinesis) Property Featurization (Spark EMR) User Profiles (Spark EMR) Ranking (Spark EMR) Wedge Counting Collaborative Filtering (Spark EMR) Property Aggregate Features (Spark EMR) Data Collection Systems (Java/Python/SQL)
  • 13.
    Training & scoring Collectuser behavior and real-estate data, train the various models, generate the candidate set, and make predictions. User Behavior (Kinesis /S3) Public Record (Kinesis / S3) Event API (Java) Producer (Python) Filter (Spark) User Store (Hive / S3) Spark job creates Hive table with user events (uid, pid) partitioned by date Active Listings (Kinesis / S3) Producer (Python) Training Data (Spark) Training Set (Hive / S3) pid -> uid reverse index Past and current user events Models (Python) Train Models (Spark) Score (Spark) Recommendations Property Data Collaborative Filtering / User Profile Models Hashmap (Redis) Wedge features or property features (user profile)
  • 14.
    Ad hoc environment Scalecluster to accommodate more users
  • 15.
  • 16.
    Some of ourcustomers running Spark on EMR Internet of things (IOT)
  • 17.