Unified Data API for Distributed Cloud Analytics and AI

Unified Data Analytics and AI
Any Stack Any Cloud
Bin Fan (binfan@alluxio.com), Founding Engineer, VP of Open Source @ Alluxio

ALLUXIO 2
About Me
2
Bin Fan (https://www.linkedin.com/in/bin-fan/)
● Founding Engineer, VP Open Source @ Alluxio
● Alluxio PMC Co-Chair, Presto TSC/committer
● Email: binfan@alluxio.com
● PhD in CS @ Carnegie Mellon University

● Originally a research project (Tachyon) in UC Berkeley AMPLab led by by-then PHD student
Haoyuan Li (Alluxio founder CEO)
● Backed by top VCs (e.g., Andreessen Horowitz) with $70M raised in total, Series C ($50M)
announced in 2021
● Deployed in production at large scale in Facebook, Uber, Microsoft, Tencent, Tiktok and etc
● More than 1200 Contributors on Github. In 2021, more than 40% commits in Github were
contributed by the community users
● The 9th most critical Java-based Open-Source projects on Github by Google/OpenSSF[1]
Alluxio Overview
ALLUXIO 3
[1] Google Comes Up With A Metric For Gauging Critical Open-Source Projects

Alluxio (Tachyon) back in 2015
Screenshot of Tachyon talk at AMPLab back in 2015
What is Tachyon Stack Release Growth

5
AMPLab活动上Tachyon演讲的截图
Alluxio (Tachyon) in 2015
Spark Task1 Spark Task 2
HDFS / Amazon S3
HDFS
disk
block 1
block 3
block 2
block 4
Tachyon
in-memory
RDD

Topology
● On-prem Hadoop → Cloud-native, Multi- or Hybrid-cloud,
Multi-datacenter
Computation
● MR/Spark → Spark, Presto, Hive, Tensorflow, Pytorch ….
● More mature frameworks (less frequent OOM etc)
Data access pattern
● Sequential-read (e.g., scanning) on unstructured files → Ad-hoc
read into structured/columnar data
● Hundred to thousand of big files → millions of small files
Whatʼs Diﬀerent Today
ALLUXIO 6

Data Storage
● On-prem & colocated HDFS → S3 !!! and other object stores
(possibly across regions like us-east & us-west),
and legacy on-prem HDFS in service
Resource/Job Orchestration
● YARN → K8s
○ Lost focus on data locality
The Evolution from Hadoop to Cloud-native Era
ALLUXIO 7

Unprecedented Complexity of Data Platforms
8
Data Trend Complex Platform
New compute and storage tech
created every 3-8 years
On-premise, cloud, hybrid,
multi-cloud environments all have
different environment properties
More data generated every day,
and stored in data silos
Data copies, synchronization costs
More people and teams need to
access and leverage these data
Multiple APIs necessitate
integration and application rewrites

Inefficient Manual Copy Across Data Centers, Regions, Clouds
v
REGION A
v
REGION B
REGION A REGION B
PRIVATE DATA
CENTERS
Amazon
EMR
Cloud
Dataproc
Kubernetes
Engine
Compute
Engine
Hive
DATACENTER 2
DATACENTER 1
ERROR PRONE AND
NETWORK INTENSIVE
DATA COPIES

Acceleration &
auto-tiering of remote
data sources
EFFICIENT ACCESS &
DATA MANAGEMENT
Agility across regions for
private, hybrid or
multi-cloud
ENVIRONMENT
AGNOSTICITY
Serve analytics & AI from
multiple data locations
UNIFICATION OF
DATA LAKES
≈
10
Strong Market Demand For Simplification

Analytics & AI
in the Hybrid & Multi-Cloud Era
Available:
11

No-copy data access across silos
agnostic to compute engine
Foundation of a heterogeneous data
platform across geos
≈
Multi-Cloud Ready Analytics & AI Platform
v
REGION A
v
REGION B
REGION A REGION B
GKE
DATACENTER 2
DATACENTER 1
HMS
12
Solution

INTERNET
PUBLIC CLOUD PROVIDERS
GENERAL
E-COMMERCE
OTHERS
TECHNOLOGY FINANCIAL SERVICES
TELCO & MEDIA
LEARN MORE
Companies Using Alluxio

https://aws.amazon.com/big-data/datalakes-and-analytics/modern-data-architecture/
Modern Data Architecture on AWS

ALLUXIO 15
Examples to eliminate data copies
Case Studies
15

Expedia: Unify Data Lakes Across Multiple Geographic Regions in the Cloud
Problems Encountered Alluxio’s Solution Results Achieved
Data silos for different brands
ingesting data across multiple
regions in AWS
Central analytics query across
data silos suffered from poor
UX and long time to insight
Manual replication resulted in
operational inefficiency and
expensive network egress
Enhanced UX with consistent &
high performance analytics,
reducing time to insights
50%
Reduced cost per query
Unify data silos without the
need to copy or move data
Federate Data Lakes w/o Replication & Serve Various Compute Engines

v
BRAND A
v
BRAND B
BRAND C MAIN DATA LAKE
US-WEST-1
US-EAST-1
US-EAST-2
US-WEST-2
DATA
REPLICATION
Hive
Hive
Data Replication for Cross-region Data Access

Data Lake D
Data Lake A
Data Lake C
Main Data Lake
Replicated Data Lake
Replicated Data Lake
Data Lake B
CircusTrain
CircusTrain
CircusTrain
CircusTrain
CircusTrain
Hive
Hive
…
US-WEST-2 US-EAST-1

v
BRAND A
v
BRAND B
BRAND C MAIN DATA LAKE
US-WEST-1
US-EAST-1
US-EAST-2
US-WEST-2
MOUNT
Hive
Hive
Alluxio for Cross-region Data Access

Data Lake D
Data Lake B
Data Lake C
Main Data Lake
US-WEST-2 US-EAST-1
Data Lake A
Hive Hive
…

us-west-1 us-west-2
MAIN DATA LAKE
SQL query
Conversion
If local S3, s3://
If cross-region S3, alluxio://
us-east-1
Hive
Object Redirection with Waggle Dance

ALLUXIO 22
Enable a Hybrid Data Lake
Architecture Overview
22

ARCHITECTURE
Alluxio
Master
Consensus
Standby
Master
WAN
Alluxio
Worker
RAM / SSD / HDD
Alluxio
Worker
RAM / SSD / HDD
…
…
S3 region-us-east 1
S3 region-us-west 1
Control Path
Data Path
Alluxio
Client
Alluxio
Client

DATA LOCALITY WITH SCALE-OUT WORKERS
Local performance for remote data with intelligent multi-tiering
RAM SSD HDD
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion,TTL
AWS S3
AWS EC2
Big Data ETL
Big Data Query

Synchronization of changes across clusters
Alluxio Master
Policies for pinning,
promotion/demotion,TTL
Metadata Synchronization
AWS S3
AWS EC2
Big Data ETL
Big Data Query
RAM SSD
METADATA LOCALITY WITH SCALEABLE MASTERS
RocksDB

Spark
Alluxio
S3
Co-locate Alluxio Workers with compute for
optimal I/O performance
Remote cluster
Same cluster
Spark
Alluxio
S3
Deploy Alluxio as standalone cluster
between compute and Storage
Remote cluster
Same data center / region
Presto
26
Long-running Instances Ephemeral Elastic
DEPLOYMENT APPROACHES

UNIFIED NAMESPACE
With Replication & Live Data Migration Capabilities
• Single Alluxio path backed by multiple S3 regions
• Example policy: Migrate data older than 7 days from S3 region us west 1 to S3 region us east 1
Alluxio
S3 region us east 1
alluxio://host:port/
Data Users
Alice Bob
s3://bucket/
Users
Alice Bob
S3 region us west 1
s3://bucket
Reports Sales
Reports Sales

ALLUXIO 28
Training & Data Pre-processing
ML/DL
28

I/O Challenges in ML/DL
ALLUXIO 29
Training data often
consists of a
massive amount of
small files (billions
of 100KB photos)
Size of training
data keeps
growing & can
exceed individual
server capacity.
Training jobs are
highly concurrent,
require high I/O to
keep GPU utilized
Whatʼs Diﬀerent
29

Using Alluxio for DL
Alluxio
Server
Alluxio
Server ...
Training Instances
POSIX POSIX POSIX
ALLUXIO 30
- Only fetch data on on cache miss
- No need to copy data before use
Distributed Caching
30

Consistent
Performance
Direct access to
data
Low latency and
high throughput
High GPU
utilization rate
ALLUXIO 31
Using Alluxio for DL
Distributed Caching
31

MOMONASDAQ:MOMO
runs thousands of Alluxio nodes across multiple Alluxio clusters,
managing more than 100+ TB data for search and training:
● Support multiple storage & compute frameworks.
● Accelerate compute & training tasks
● Reduce the metadata and data overhead
Model Training using PyTorch + Alluxio + Ceph
● 2 billion small files
● Reduce metadata & data interactions with Ceph to improve performance
32
https://www.alluxio.io/resources/videos/ml-and-query-acceleration-at-momo-with-alluxio-chinese/
Large Scale Deep Learning
TOPOLOGY: ON-PREMISES
Alluxio’s Solution
32

Twitter.com/alluxio
Linkedin.com/alluxio
Website
www.alluxio.io
Slack
https://alluxio.io/slack
@
Social Media
Q&A

Unified Data API for Distributed Cloud Analytics and AI

More Related Content

Similar to Unified Data API for Distributed Cloud Analytics and AI

More from Alluxio, Inc.

Recently uploaded

Unified Data API for Distributed Cloud Analytics and AI