实时数据处理理中
消息,计算和存储的统⼀一
Apache Pulsar
演讲者/streamlio 翟佳
4
What’s the state of the art
5
What’s the state of the art
6
Apache Pulsar — Unify
Segment Store
Stream Table
BookKeeper
Messaging Computing
Pulsar Broker Pulsar Functions
7
Ordering
Guaranteed ordering
Multi-tenancy
A single cluster can
support many tenants
and use cases
High throughput
Can reach 1.8 M
messages/s in a
single partition
Durability
Data replicated and
synced to disk
Geo-replication
Out of box support for
geographically
distributed
applications
Unified messaging
model
Support both
Streaming and
Queuing in a single
model
Delivery Guarantees
At least once, at most
once and effectively once
Low Latency
Low publish latency of
5ms at 99pct
Highly scalable
Can support millions of
topics
Why Apache Pulsar?
8
Pulsar Architecture
Pulsar Broker 1 Pulsar Broker 1 Pulsar Broker 1
Bookie 1 Bookie 2 Bookie 3 Bookie 4 Bookie 5
Apache BookKeeper
Apache Pulsar
Producer Consumer Separate layers between
brokers bookies
• Broker and bookies can
be added
independently
• Traffic can be shifted
very quickly across
brokers
• New bookies will ramp
up on traffic quickly
9
Messaging
Segment Store
Stream Table
BookKeeper
Messaging Computing
Pulsar Broker Pulsar Functions
10
Messaging - Concepts
11
Messaging - Namespace
12
Messaging - Queuing & Streaming
(kafka, kinesis, …)
(SQS, ActiveMQ, RabbitMQ, …)
13
Messaging -ACK
Individual
Cumulative
14
Messaging - Retention
15
Storage
Segment Store
Stream Table
BookKeeper
Messaging Computing
Pulsar Broker Pulsar Functions
16
Storage -Apache BookKeeper
• A replicated log storage
• Low-latency durable writes
• Simple repeatable read consistency
• Highly available
• Store many logs per node
• I/O Isolation
17
Storage -Apache BookKeeper
18
Storage - Segment Centric
19
Storage - Segment/Stream/Table
Segment Store
Stream Table
BookKeeper
Messaging Computing
Pulsar Broker Pulsar Functions
20
Compute
Segment Store
Stream Table
BookKeeper
Messaging Computing
Pulsar Broker Pulsar Functions
21
Compute Representation
Abstract View
f(x)
Incoming Messages Output Messages
22
Lessons learnt
A significant percentage of transformations are simple
ETL/Reactive Services/Classification/Real-time Aggregation
Event Routing/Microservices
The emergence of Serverless
Simple Function API
Run per event
Composition APIs to do complex things
Wildly popular
23
Whats needed: Stream-Native Compute
Insight gained from serverless
Simplest possible API
Method/Procedure/Function
Multi Language API
Scale developers
Stream native concepts
Input/Output/Log as topics
Flexible runtime
Simple standalone applications vs system managed applications
24
Pulsar Functions —API
SDK less API
import java.util.function.Function;
public class ExclamationFunction implements Function<String, String> {
@Override
public String apply(String input) {
return input + "!";
}
}
SDK API
import org.apache.pulsar.functions.api.PulsarFunction;
import org.apache.pulsar.functions.api.Context;
public class ExclamationFunction implements PulsarFunction<String, String> {
@Override
public String process(String input, Context context) {
return input + "!";
}
}
25
Pulsar Functions
Running as a standalone application
bin/pulsar-admin functions localrun 
--input persistent://sample/standalone/ns1/test_input 
--output persistent://sample/standalone/ns1/test_result 
--className org.mycompany.ExclamationFunction 
--jar myjar.jar
Runs as a standalone process
Run as many instances as you want. Framework automatically balances
data
Run and manage via Mesos/K8/Nomad/your favorite tool
26
Pulsar Functions: Use Cases
Sensor devices generate tons of data
Lot of local actions
Simple filtering, threshold detection, regex matching, etc
Resource Constrained
Limited scope for Full blown schedulers/Job Managers
Models computed via offline analysis
Incoming requests should be classified using the model
Function is a natural representation for the classification action
Model itself can be stored in Bookkeeper
Edge
Computing
Model
Serving
27
Pulsar Functions
Unify Messaging and Compute cluster into one
Function executed for every message of input topic
Supports multiple topics as inputs
Runtime User Controlled Guarantees:
ATMOST_ONCE / ATLEAST_ONCE / EFFECTIVE_ONCE
Built-in State Management:
Unified Stream & State Store with BookKeeper.
Simplified application development
28
Unified Streaming Solution
Segment Store
Stream Table
BookKeeper
Messaging Computing
Pulsar Broker Pulsar Functions
DATA
DATA
DATA
DATA
DATA
Spark
Flink
HDFS
。。。
29
Messaging Benchmark
https://github.com/openmessaging/openmessaging-benchmark
30
Benchmark
• Testing goals
• Throughput & latency under different conditions
• Min 2 guaranteed copies
• Running on 3 EC2 VMs with local SSDs
31
Kafka settings
• Topic settings
replicationFactor=3
min.insync.replicas=2
log.flush.interval.ms= # Using default: means no fsyncs
• Kafka producer config
acks=all
linger.ms=1
batch.size=131072
32
Pulsar/BookKeeper settings
• Use ensemble=3 write=3 ack=2
• Data synced on disk before ack
• Pulsar publisher settings:
batchingEnabled : true
batchingMaxPublishDelayMs : 1
blockIfQueueFull : true
33
Max throughput
1 Topic
1 Partition
1 Producer
1 Consumer
1Kb msg
34
Latency at fixed rate - 50K msg/s
1 Topic
1 Partition
1 Producer
1 Consumer
1Kb msg
35
Latency at fixed rate - 50K msg/s
1 Topic
1 Partition
1 Producer
1 Consumer
1Kb msg
36
Unified Streaming Solution
Segment Store
Stream Table
BookKeeper
Messaging Computing
Pulsar Broker Pulsar Functions
DATA
DATA
DATA
DATA
DATA
Spark
Flink
HDFS
。。。
37
Curious to Learn More?
• Apache Pulsar : http://pulsar.incubator.apache.org
• Apache BookKeeper : http://bookkeeper.apache.org
• Technical Blog : https://streaml.io/blog/
• Twitter: @apache_pulsar @asfbookkeeper
• slack:
• https://apache-pulsar.herokuapp.com/
• https://apachebookkeeper.herokuapp.com/
Apache pulsar
Apache pulsar
Apache pulsar
Apache pulsar
Apache pulsar

Apache pulsar