Unified Data Analytics and AI
Any Stack Any Cloud
Bin Fan (binfan@alluxio.com), Founding Engineer, VP of Open Source @ Alluxio
ALLUXIO 2
About Me
2
Bin Fan (https://www.linkedin.com/in/bin-fan/)
● Founding Engineer, VP Open Source @ Alluxio
● Alluxio PMC Co-Chair, Presto TSC/committer
● Email: binfan@alluxio.com
● PhD in CS @ Carnegie Mellon University
● Originally a research project (Tachyon) in UC Berkeley AMPLab led by by-then PHD student
Haoyuan Li (Alluxio founder CEO)
● Backed by top VCs (e.g., Andreessen Horowitz) with $70M raised in total, Series C ($50M)
announced in 2021
● Deployed in production at large scale in Facebook, Uber, Microsoft, Tencent, Tiktok and etc
● More than 1200 Contributors on Github. In 2021, more than 40% commits in Github were
contributed by the community users
● The 9th most critical Java-based Open-Source projects on Github by Google/OpenSSF[1]
Alluxio Overview
ALLUXIO 3
[1] Google Comes Up With A Metric For Gauging Critical Open-Source Projects
Alluxio (Tachyon) back in 2015
Screenshot of Tachyon talk at AMPLab back in 2015
What is Tachyon Stack Release Growth
5
AMPLab活动上Tachyon演讲的截图
Alluxio (Tachyon) in 2015
Spark Task1 Spark Task 2
HDFS / Amazon S3
HDFS
disk
block 1
block 3
block 2
block 4
Tachyon
in-memory
RDD
Topology
● On-prem Hadoop → Cloud-native, Multi- or Hybrid-cloud,
Multi-datacenter
Computation
● MR/Spark → Spark, Presto, Hive, Tensorflow, Pytorch ….
● More mature frameworks (less frequent OOM etc)
Data access pattern
● Sequential-read (e.g., scanning) on unstructured files → Ad-hoc
read into structured/columnar data
● Hundred to thousand of big files → millions of small files
Whatʼs Different Today
ALLUXIO 6
Data Storage
● On-prem & colocated HDFS → S3 !!! and other object stores
(possibly across regions like us-east & us-west),
and legacy on-prem HDFS in service
Resource/Job Orchestration
● YARN → K8s
○ Lost focus on data locality
The Evolution from Hadoop to Cloud-native Era
ALLUXIO 7
Unprecedented Complexity of Data Platforms
8
Data Trend Complex Platform
New compute and storage tech
created every 3-8 years
On-premise, cloud, hybrid,
multi-cloud environments all have
different environment properties
More data generated every day,
and stored in data silos
Data copies, synchronization costs
More people and teams need to
access and leverage these data
Multiple APIs necessitate
integration and application rewrites
Inefficient Manual Copy Across Data Centers, Regions, Clouds
v
REGION A
v
REGION B
REGION A REGION B
PRIVATE DATA
CENTERS
Amazon
EMR
Cloud
Dataproc
Kubernetes
Engine
Compute
Engine
Hive
DATACENTER 2
DATACENTER 1
ERROR PRONE AND
NETWORK INTENSIVE
DATA COPIES
Acceleration &
auto-tiering of remote
data sources
EFFICIENT ACCESS &
DATA MANAGEMENT
Agility across regions for
private, hybrid or
multi-cloud
ENVIRONMENT
AGNOSTICITY
Serve analytics & AI from
multiple data locations
UNIFICATION OF
DATA LAKES
≈
10
Strong Market Demand For Simplification
Analytics & AI
in the Hybrid & Multi-Cloud Era
Available:
11
No-copy data access across silos
agnostic to compute engine
Foundation of a heterogeneous data
platform across geos
≈
Multi-Cloud Ready Analytics & AI Platform
v
REGION A
v
REGION B
REGION A REGION B
GKE
DATACENTER 2
DATACENTER 1
HMS
12
Solution
INTERNET
PUBLIC CLOUD PROVIDERS
GENERAL
E-COMMERCE
OTHERS
TECHNOLOGY FINANCIAL SERVICES
TELCO & MEDIA
LEARN MORE
Companies Using Alluxio
https://aws.amazon.com/big-data/datalakes-and-analytics/modern-data-architecture/
Modern Data Architecture on AWS
ALLUXIO 15
Examples to eliminate data copies
Case Studies
15
Expedia: Unify Data Lakes Across Multiple Geographic Regions in the Cloud
Problems Encountered Alluxio’s Solution Results Achieved
Data silos for different brands
ingesting data across multiple
regions in AWS
Central analytics query across
data silos suffered from poor
UX and long time to insight
Manual replication resulted in
operational inefficiency and
expensive network egress
Enhanced UX with consistent &
high performance analytics,
reducing time to insights
50%
Reduced cost per query
Unify data silos without the
need to copy or move data
Federate Data Lakes w/o Replication & Serve Various Compute Engines
v
BRAND A
v
BRAND B
BRAND C MAIN DATA LAKE
US-WEST-1
US-EAST-1
US-EAST-2
US-WEST-2
DATA
REPLICATION
Hive
Hive
Data Replication for Cross-region Data Access
Data Lake D
Data Lake A
Data Lake C
Main Data Lake
Replicated Data Lake
Replicated Data Lake
Data Lake B
CircusTrain
CircusTrain
CircusTrain
CircusTrain
CircusTrain
Hive
Hive
…
US-WEST-2 US-EAST-1
v
BRAND A
v
BRAND B
BRAND C MAIN DATA LAKE
US-WEST-1
US-EAST-1
US-EAST-2
US-WEST-2
MOUNT
Hive
Hive
Alluxio for Cross-region Data Access
Data Lake D
Data Lake B
Data Lake C
Main Data Lake
US-WEST-2 US-EAST-1
Data Lake A
Hive Hive
…
us-west-1 us-west-2
MAIN DATA LAKE
SQL query
Conversion
If local S3, s3://
If cross-region S3, alluxio://
us-east-1
Hive
Object Redirection with Waggle Dance
ALLUXIO 22
Enable a Hybrid Data Lake
Architecture Overview
22
ARCHITECTURE
Alluxio
Master
Consensus
Standby
Master
WAN
Alluxio
Worker
RAM / SSD / HDD
Alluxio
Worker
RAM / SSD / HDD
…
…
S3 region-us-east 1
S3 region-us-west 1
Control Path
Data Path
Alluxio
Client
Alluxio
Client
DATA LOCALITY WITH SCALE-OUT WORKERS
Local performance for remote data with intelligent multi-tiering
RAM SSD HDD
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion,TTL
AWS S3
AWS EC2
Big Data ETL
Big Data Query
Synchronization of changes across clusters
Alluxio Master
Policies for pinning,
promotion/demotion,TTL
Metadata Synchronization
AWS S3
AWS EC2
Big Data ETL
Big Data Query
RAM SSD
METADATA LOCALITY WITH SCALEABLE MASTERS
RocksDB
Spark
Alluxio
S3
Co-locate Alluxio Workers with compute for
optimal I/O performance
Remote cluster
Same cluster
Spark
Alluxio
S3
Deploy Alluxio as standalone cluster
between compute and Storage
Remote cluster
Same data center / region
Presto
26
Long-running Instances Ephemeral Elastic
DEPLOYMENT APPROACHES
UNIFIED NAMESPACE
With Replication & Live Data Migration Capabilities
• Single Alluxio path backed by multiple S3 regions
• Example policy: Migrate data older than 7 days from S3 region us west 1 to S3 region us east 1
Alluxio
S3 region us east 1
alluxio://host:port/
Data Users
Alice Bob
s3://bucket/
Users
Alice Bob
S3 region us west 1
s3://bucket
Reports Sales
Reports Sales
ALLUXIO 28
Training & Data Pre-processing
ML/DL
28
I/O Challenges in ML/DL
ALLUXIO 29
Training data often
consists of a
massive amount of
small files (billions
of 100KB photos)
Size of training
data keeps
growing & can
exceed individual
server capacity.
Training jobs are
highly concurrent,
require high I/O to
keep GPU utilized
Whatʼs Different
29
Using Alluxio for DL
Alluxio
Server
Alluxio
Server ...
Training Instances
POSIX POSIX POSIX
ALLUXIO 30
- Only fetch data on on cache miss
- No need to copy data before use
Distributed Caching
30
Consistent
Performance
Direct access to
data
Low latency and
high throughput
High GPU
utilization rate
ALLUXIO 31
Using Alluxio for DL
Distributed Caching
31
MOMONASDAQ:MOMO
runs thousands of Alluxio nodes across multiple Alluxio clusters,
managing more than 100+ TB data for search and training:
● Support multiple storage & compute frameworks.
● Accelerate compute & training tasks
● Reduce the metadata and data overhead
Model Training using PyTorch + Alluxio + Ceph
● 2 billion small files
● Reduce metadata & data interactions with Ceph to improve performance
32
https://www.alluxio.io/resources/videos/ml-and-query-acceleration-at-momo-with-alluxio-chinese/
Large Scale Deep Learning
TOPOLOGY: ON-PREMISES
Alluxio’s Solution
32
Twitter.com/alluxio
Linkedin.com/alluxio
Website
www.alluxio.io
Slack
https://alluxio.io/slack
@
Social Media
Q&A

Unified Data API for Distributed Cloud Analytics and AI

  • 1.
    Unified Data Analyticsand AI Any Stack Any Cloud Bin Fan (binfan@alluxio.com), Founding Engineer, VP of Open Source @ Alluxio
  • 2.
    ALLUXIO 2 About Me 2 BinFan (https://www.linkedin.com/in/bin-fan/) ● Founding Engineer, VP Open Source @ Alluxio ● Alluxio PMC Co-Chair, Presto TSC/committer ● Email: binfan@alluxio.com ● PhD in CS @ Carnegie Mellon University
  • 3.
    ● Originally aresearch project (Tachyon) in UC Berkeley AMPLab led by by-then PHD student Haoyuan Li (Alluxio founder CEO) ● Backed by top VCs (e.g., Andreessen Horowitz) with $70M raised in total, Series C ($50M) announced in 2021 ● Deployed in production at large scale in Facebook, Uber, Microsoft, Tencent, Tiktok and etc ● More than 1200 Contributors on Github. In 2021, more than 40% commits in Github were contributed by the community users ● The 9th most critical Java-based Open-Source projects on Github by Google/OpenSSF[1] Alluxio Overview ALLUXIO 3 [1] Google Comes Up With A Metric For Gauging Critical Open-Source Projects
  • 4.
    Alluxio (Tachyon) backin 2015 Screenshot of Tachyon talk at AMPLab back in 2015 What is Tachyon Stack Release Growth
  • 5.
    5 AMPLab活动上Tachyon演讲的截图 Alluxio (Tachyon) in2015 Spark Task1 Spark Task 2 HDFS / Amazon S3 HDFS disk block 1 block 3 block 2 block 4 Tachyon in-memory RDD
  • 6.
    Topology ● On-prem Hadoop→ Cloud-native, Multi- or Hybrid-cloud, Multi-datacenter Computation ● MR/Spark → Spark, Presto, Hive, Tensorflow, Pytorch …. ● More mature frameworks (less frequent OOM etc) Data access pattern ● Sequential-read (e.g., scanning) on unstructured files → Ad-hoc read into structured/columnar data ● Hundred to thousand of big files → millions of small files Whatʼs Different Today ALLUXIO 6
  • 7.
    Data Storage ● On-prem& colocated HDFS → S3 !!! and other object stores (possibly across regions like us-east & us-west), and legacy on-prem HDFS in service Resource/Job Orchestration ● YARN → K8s ○ Lost focus on data locality The Evolution from Hadoop to Cloud-native Era ALLUXIO 7
  • 8.
    Unprecedented Complexity ofData Platforms 8 Data Trend Complex Platform New compute and storage tech created every 3-8 years On-premise, cloud, hybrid, multi-cloud environments all have different environment properties More data generated every day, and stored in data silos Data copies, synchronization costs More people and teams need to access and leverage these data Multiple APIs necessitate integration and application rewrites
  • 9.
    Inefficient Manual CopyAcross Data Centers, Regions, Clouds v REGION A v REGION B REGION A REGION B PRIVATE DATA CENTERS Amazon EMR Cloud Dataproc Kubernetes Engine Compute Engine Hive DATACENTER 2 DATACENTER 1 ERROR PRONE AND NETWORK INTENSIVE DATA COPIES
  • 10.
    Acceleration & auto-tiering ofremote data sources EFFICIENT ACCESS & DATA MANAGEMENT Agility across regions for private, hybrid or multi-cloud ENVIRONMENT AGNOSTICITY Serve analytics & AI from multiple data locations UNIFICATION OF DATA LAKES ≈ 10 Strong Market Demand For Simplification
  • 11.
    Analytics & AI inthe Hybrid & Multi-Cloud Era Available: 11
  • 12.
    No-copy data accessacross silos agnostic to compute engine Foundation of a heterogeneous data platform across geos ≈ Multi-Cloud Ready Analytics & AI Platform v REGION A v REGION B REGION A REGION B GKE DATACENTER 2 DATACENTER 1 HMS 12 Solution
  • 13.
    INTERNET PUBLIC CLOUD PROVIDERS GENERAL E-COMMERCE OTHERS TECHNOLOGYFINANCIAL SERVICES TELCO & MEDIA LEARN MORE Companies Using Alluxio
  • 14.
  • 15.
    ALLUXIO 15 Examples toeliminate data copies Case Studies 15
  • 16.
    Expedia: Unify DataLakes Across Multiple Geographic Regions in the Cloud Problems Encountered Alluxio’s Solution Results Achieved Data silos for different brands ingesting data across multiple regions in AWS Central analytics query across data silos suffered from poor UX and long time to insight Manual replication resulted in operational inefficiency and expensive network egress Enhanced UX with consistent & high performance analytics, reducing time to insights 50% Reduced cost per query Unify data silos without the need to copy or move data Federate Data Lakes w/o Replication & Serve Various Compute Engines
  • 17.
    v BRAND A v BRAND B BRANDC MAIN DATA LAKE US-WEST-1 US-EAST-1 US-EAST-2 US-WEST-2 DATA REPLICATION Hive Hive Data Replication for Cross-region Data Access
  • 18.
    Data Lake D DataLake A Data Lake C Main Data Lake Replicated Data Lake Replicated Data Lake Data Lake B CircusTrain CircusTrain CircusTrain CircusTrain CircusTrain Hive Hive … US-WEST-2 US-EAST-1
  • 19.
    v BRAND A v BRAND B BRANDC MAIN DATA LAKE US-WEST-1 US-EAST-1 US-EAST-2 US-WEST-2 MOUNT Hive Hive Alluxio for Cross-region Data Access
  • 20.
    Data Lake D DataLake B Data Lake C Main Data Lake US-WEST-2 US-EAST-1 Data Lake A Hive Hive …
  • 21.
    us-west-1 us-west-2 MAIN DATALAKE SQL query Conversion If local S3, s3:// If cross-region S3, alluxio:// us-east-1 Hive Object Redirection with Waggle Dance
  • 22.
    ALLUXIO 22 Enable aHybrid Data Lake Architecture Overview 22
  • 23.
    ARCHITECTURE Alluxio Master Consensus Standby Master WAN Alluxio Worker RAM / SSD/ HDD Alluxio Worker RAM / SSD / HDD … … S3 region-us-east 1 S3 region-us-west 1 Control Path Data Path Alluxio Client Alluxio Client
  • 24.
    DATA LOCALITY WITHSCALE-OUT WORKERS Local performance for remote data with intelligent multi-tiering RAM SSD HDD Read & Write Buffering Transparent to App Policies for pinning, promotion/demotion,TTL AWS S3 AWS EC2 Big Data ETL Big Data Query
  • 25.
    Synchronization of changesacross clusters Alluxio Master Policies for pinning, promotion/demotion,TTL Metadata Synchronization AWS S3 AWS EC2 Big Data ETL Big Data Query RAM SSD METADATA LOCALITY WITH SCALEABLE MASTERS RocksDB
  • 26.
    Spark Alluxio S3 Co-locate Alluxio Workerswith compute for optimal I/O performance Remote cluster Same cluster Spark Alluxio S3 Deploy Alluxio as standalone cluster between compute and Storage Remote cluster Same data center / region Presto 26 Long-running Instances Ephemeral Elastic DEPLOYMENT APPROACHES
  • 27.
    UNIFIED NAMESPACE With Replication& Live Data Migration Capabilities • Single Alluxio path backed by multiple S3 regions • Example policy: Migrate data older than 7 days from S3 region us west 1 to S3 region us east 1 Alluxio S3 region us east 1 alluxio://host:port/ Data Users Alice Bob s3://bucket/ Users Alice Bob S3 region us west 1 s3://bucket Reports Sales Reports Sales
  • 28.
    ALLUXIO 28 Training &Data Pre-processing ML/DL 28
  • 29.
    I/O Challenges inML/DL ALLUXIO 29 Training data often consists of a massive amount of small files (billions of 100KB photos) Size of training data keeps growing & can exceed individual server capacity. Training jobs are highly concurrent, require high I/O to keep GPU utilized Whatʼs Different 29
  • 30.
    Using Alluxio forDL Alluxio Server Alluxio Server ... Training Instances POSIX POSIX POSIX ALLUXIO 30 - Only fetch data on on cache miss - No need to copy data before use Distributed Caching 30
  • 31.
    Consistent Performance Direct access to data Lowlatency and high throughput High GPU utilization rate ALLUXIO 31 Using Alluxio for DL Distributed Caching 31
  • 32.
    MOMONASDAQ:MOMO runs thousands ofAlluxio nodes across multiple Alluxio clusters, managing more than 100+ TB data for search and training: ● Support multiple storage & compute frameworks. ● Accelerate compute & training tasks ● Reduce the metadata and data overhead Model Training using PyTorch + Alluxio + Ceph ● 2 billion small files ● Reduce metadata & data interactions with Ceph to improve performance 32 https://www.alluxio.io/resources/videos/ml-and-query-acceleration-at-momo-with-alluxio-chinese/ Large Scale Deep Learning TOPOLOGY: ON-PREMISES Alluxio’s Solution 32
  • 33.