Apache Pulsar Overview

Apache Pulsar: Next Generation
Cloud-Native Messaging & Streaming
January 2019

Increasingly connected world
!3
Internet of Things
30 B connected devices by 2020
Health Care
153 Exabytes (2013) -> 2314 Exabytes
(2020)
Machine Data
40% of digital universe by 2020
Connected Vehicles
Data transferred per vehicle per month
4 TB -> 10 TB
Digital Assistants (Predictive Analytics)
$2B (2012) -> $6.5B (2019) [1]
Siri/Cortana/Google Now
Augmented/Virtual Reality
$150B by 2020 [2]
Oculus/HoloLens/Magic Leap
Ñ
+
>

• Events are analyzed and processed as they arrive
• Decisions are timely, contextual and based on fresh data
• Decision latency is eliminated
• Data in motion
Fast data processing
!4
Ingest/
Buffer
Analyze Act

Elements of stream processing
!5
ComputeMessaging
Storage
Data Ingestion Data Processing
Results StorageData Storage
Data
Serving

Apache Pulsar
!6
Flexible Messaging + Streaming System
backed by durable log storage

Apache Pulsar:  
Messaging + Storage

Apache Pulsar: Tenants, namespaces, topics
!8
Apache Pulsar Cluster
Product Safety
ETL
Fraud
Detection
Topic-1
Account History
Topic-2
User Clustering
Topic-1
Risk Classification
Marketing
Campaigns
ETL
Topic-1
Budgeted Spend
Topic-2
Demographic Classification
Topic-1
Location Resolution
Data
Serving
Microservice
Topic-1
Customer Authentication
Tenants
Namespaces

Apache Pulsar: Topics
!9
Topic
Producers
Consumers
Time

Apache Pulsar: Topic partitions
!10
Topic - P0 Producers
Consumers
Time
Topic - P1
Topic - P2

Apache Pulsar: Segments
!11
Producers
Consumers
Time
Segment 1 Segment 2 Segment 3
Segment 1 Segment 2 Segment 3 Segment 4
Segment 1 Segment 2 Segment 3
P0
P1
P2

Apache Pulsar
!12
Bookie Bookie Bookie
Broker Broker Broker
Producer Consumer
• Layered Architecture
• Independent Scalability 
• Fault Tolerance 
• Instant Scalability

Apache Pulsar: Segment-centric storage
!13
• Logical Partition 
• Partition divided into Segments 
• Size-based & Time-based 
• Uniformly distributed across the
cluster

• Broker is the only point of interaction for clients (producers and
consumers)
• Brokers acquire ownership of group of topics and “serve” them
• Broker has no durable state
• Provides service discovery mechanism for client to connect to right
broker
Apache Pulsar: Broker
!14

Apache Pulsar: Broker failure recovery
!16
• Topic is reassigned to an available
broker based on load
• Can reconstruct the previous state
consistently
• No data needs to be copied
• Failover handled transparently by
client library

Apache Pulsar: Bookie failure recovery
!17
• After a write failure, BookKeeper will
immediately switch write to a new
bookie, within the same segment.
• As long as we have any 3 bookies in
the cluster, we can continue to write

Apache Pulsar: Bookie failure recovery
!18
In background, starts a many-to-
many recovery process to regain
the configured replication factor

Apache Pulsar: Seamless cluster expansion
!19
1234…20212223…40414243…60616263…
Segment 1
Segment 3
Segment 2
Segment 2
Segment 1
Segment 3
Segment 4
Segment 3
Segment 2
Segment 1
Segment 4
Segment 4
Segment Y
Segment Z
Segment X

Apache Pulsar: Tiered storage
!20
Low Cost Storage
1234…20212223…40414243…60616263…
Segment 3
Segment 2Segment 3
Segment 4
Segment 3
Segment 1
Segment 4 Segment 4

Partitions vs segments - why should you care?
!21
Legacy Architectures
● Storage co-resident with processing
● Partition-centric
● Cumbersome to scale--data
redistribution, performance impact
Logical
View
Apache Pulsar
● Storage decoupled from processing
● Partitions stored as segments
● Flexible, easy scalability
Partition
Processing
& Storage
Segment 1 Segment 3Segment 2 Segment n
Partition
Broker
Partition
(primary)
Broker
Partition
(copy)
Broker
Partition
(copy)
Broker Broker Broker
Segment 1
Segment 2
Segment n
. . .
Segment 2
Segment 3
Segment n
. . .
Segment 3
Segment 1
Segment n
. . .
Segment 1
Segment 2
Segment n
. . .
Processing
(brokers)
Storage

• In Kafka, partitions are assigned to brokers “permanently”
• A single partition is stored entirely in a single node
• Retention is limited by a single node storage capacity
• Failure recovery and capacity expansion require expensive
“rebalancing”
• Rebalancing has a big impact over the system, affecting regular
traffic
Partitions vs Segments: Why should you care?
!22

Apache Pulsar: Durability
!23
Bookie
Bookie
BookieBrokerProducer
Journal
Journal
Journal
fsync
fsync
fsync

Unified messaging model: Streaming
!25
Pulsar topic/
partition
Producer 2
Producer 1
Consumer 1
Consumer 2
Subscription
A
M4
M3
M2
M1
M0
M4
M3
M2
M1
M0
X
Exclusive

Unified messaging model: Streaming
!26
Pulsar topic/
partition
Producer 2
Producer 1
Consumer 1
Consumer 2
Subscription
B
M4
M3
M2
M1
M0
M4
M3
M2
M1
M0
Failover
In case of failure in
consumer 1

Unified messaging model: Queuing
!27
Pulsar topic/
partition
Producer 2
Producer 1
Consumer 2
Consumer 3
Subscription
C
M4
M3
M2
M1
M0
Shared
Traffic is equally distributed
across consumers
Consumer 1
M4M3
M2M1M0

Replication for disaster recovery
!28
Topic (T1) Topic (T1)
Topic (T1)
Subscription
(S1)
Subscription
(S1)
Producer
(P1)
Consumer
(C1)
Producer
(P3)
Producer
(P2)
Consumer
(C2)
Data Center A Data Center B
Data Center C
Integrated in the broker
message flow
Simple configuration to
add/remove regions
Asynchronous (default) and
synchronous replication

Apache Pulsar: Multi-tenancy
!29
Product
Safety
ETL
Fraud
Detection
Topic-1
Account History
Topic-2
User Clustering
Topic-1
Risk Classification
MarketingCampaigns
ETL
Topic-1
Budgeted Spend
Topic-2
Demographic
Classification
Topic-1
Location Resolution
Data
Serving
Microservice
Topic-1
Customer
Authentication
10 TB
7 TB
5 TB
• Authentication
• Authorization
• Software isolation
• Storage quotas, flow control, back pressure, rate limiting
• Hardware isolation
• Constrain some tenants on a subset of brokers/bookies

Pulsar clients
!30
Java
Python
Go
C++ C

• Provides type safety to applications built on top of Pulsar
• Two approaches
• Client side - type safety enforcement up to the application
• Server side - system enforces type safety and ensures that producers and consumers
remain synced
• Schema registry enables clients to upload data schemas on a topic basis.
• Schemas dictate which data types are recognized as valid for that topic
Schema registry
!31

• Consume data as it is produced (pub/sub)
• Heavy weight compute - continuous data processing (DAG Processing)
• Light weight compute - transform and react to data as it arrives
• Interactive query of stored streams
How to process data modeled as streams
!33

Significant set of processing tasks are exceedingly simple
• Data transformations
• Data classification
• Data enrichment
• Data routing
• Data extraction and loading
• Real time aggregation
• Microservices
Lessons learned: Use cases
!34

Light weight compute
!35
f(x)
Incoming Messages Output Messages
ABSTRACT VIEW OF COMPUTE REPRESENTATION

Applying insight gained from serverless
• Simplest possible API function or procedure
• Support for multi language
• Use native API for each language
• Scale developers
• Use of message bus native concepts - input and output as topics
• Flexible runtime - simple standalone applications vs managed system
applications
Stream native compute using functions
!36

SDK-LESS API
import java.util.function.Function;
public class ExclamationFunction implements Function<String, String> {
@Override
public String apply(String input) {
return input + "!";
}
}
Pulsar Functions
!37

• ATMOST_ONCE
• Message acked to Pulsar as soon as we receive it
• ATLEAST_ONCE
• Message acked to Pulsar after the function completes
• Default behavior - don’t want people to loose data
• EFFECTIVELY_ONCE
• Uses Pulsar’s inbuilt effectively once semantics
• Controlled at runtime by user
Processing guarantees
!38

Deploying functions: Broker
!39
Broker 1
Worker
Function
wordcount-1
Function
transform-2
Broker 1
Worker
Function
transform-1
Function
dataroute-1
Broker 1
Worker
Function
wordcount-2
Function
transform-3
Node 1 Node 2 Node 3

Deploying functions: Worker nodes
!40
Worker
Function
wordcount-1
Function
transform-2
Worker
Function
transform-1
Function
dataroute-1
Worker
Function
wordcount-2
Function
transform-3
Broker 1 Broker 2 Broker 3

Deploying functions: Kubernetes
!41
Function
wordcount-1
Function
transform-1
Function
transform-3
Pod 1 Pod 2 Pod 3
Broker 1 Broker 2 Broker 3
Pod 7 Pod 8 Pod 9
Function
dataroute-1
Function
wordcount-2
Function
transform-2
Pod 4 Pod 5 Pod 6

Interactive querying of streams: Pulsar SQL
!42
1234…20212223…40414243…60616263…
Segment 1
Segment 3
Segment 2
Segment 2
Segment 1
Segment 3
Segment 4
Segment 3
Segment 2
Segment 1
Segment 4
Segment 4
Segment
Reader
Segment
Reader
Segment
Reader
Segment
Reader
Coordina
tor

Pulsar performance: Publish rate
!43

Pulsar performance: Latency
!44

Apache Pulsar VS. Apache Kafka
!45
Multi-tenancy
A single cluster can support
many tenants and use cases
Seamless Cluster Expansion
Expand the cluster without any
down time
High throughput & Low Latency
Can reach 1.8 M messages/s in
a single partition and publish
latency of 5ms at 99pct
Durability
Data replicated and synced to
disk
Geo-replication
Out of box support for
geographically distributed
applications
Unified messaging model
Support both Topic & Queue
semantic in a single model
Tiered Storage
Hot/warm data for real time access
and cold event data in cheaper
storage
Pulsar Functions
Flexible light weight compute
Highly scalable
Can support millions of topics, makes
data modeling easier

Apache Pulsar VS. Apache Kafka
!46
https://jack-vanlightly.com/sketches/2018/10/2/kafka-vs-pulsar-rebalancing-sketch
Thanks to JACK VANLIGHTLY

Apache Pulsar: Tying solutions together
!47
Tiered Storage
Stream Storage
AWS S3
Google Cloud
Storage
Azure Blob
Storage
HDFS
BookKeeper
Analytics
Presto SQL
Messaging
Pulsar Brokers
Event Processing
Pulsar Functions Complex Stream
Pulsar IO
Cassandra Kinesis MySQL MongoDB
Other
Frameworks

• 4+ years
• Serves 2.3 million topics
• 500 billion messages/day
• 400+ bookie nodes
• 150+ broker nodes
• Average latency < 5 ms
• 99.9% 15 ms (strong durability guarantees)
• Zero data loss
• 150+ applications
• Self served provisioning
• Full-mesh cross-datacenter replication - 8+ data centers
Apache Pulsar in production at scale
!48

• Twitter: @apache_pulsar
• Wechat Subscription: ApachePulsar
• Mailing Lists 
dev@pulsar.apache.org, users@pulsar.apache.org
• Slack 
https://apache-pulsar.slack.com
• Localization 
https://crowdin.com/project/apache-pulsar
• Github 
https://github.com/apache/pulsar 
https://github.com/apache/bookkeeper
Apache Pulsar community
!49

• Understanding How Pulsar Works 
https://jack-vanlightly.com/blog/2018/10/2/understanding-how-apache-pulsar-works
• How To (Not) Lose Messages on Apache Pulsar Cluster 
https://jack-vanlightly.com/blog/2018/10/21/how-to-not-lose-messages-on-an-apache-
pulsar-cluster
More readings
!50

• Unified queuing and streaming 
https://streaml.io/blog/pulsar-streaming-queuing
• Segment centric storage 
https://streaml.io/blog/pulsar-segment-based-architecture
• Messaging, Storage or Both 
https://streaml.io/blog/messaging-storage-or-both
• Access patterns and tiered storage 
https://streaml.io/blog/access-patterns-and-tiered-storage-in-apache-pulsar
• Tiered Storage in Apache Pulsar 
https://streaml.io/blog/tiered-storage-in-apache-pulsar
More readings
!51

Conclusion
!52
ComputeMessaging
Storage
Apache Pulsar - Cloud Native

Apache Pulsar Overview

More Related Content

What's hot

Similar to Apache Pulsar Overview

More from Streamlio

Recently uploaded

Apache Pulsar Overview