1
A Serverless Approach to Data
Processing using Apache Pulsar
Karthik Ramasamy

Cofounder and Chief Product Officer

karthik@streaml.io
2
Event Driven Architectures
The	rise	of	Real	Time
Big	Data	began	with	batch	
HDFS/MapReduce/Hive	
Reac:on	:mes	became	important	
Reduce	:me	between	data	arrival	and	data	analysis/ac:on	
Emergence	of	real	:me	streaming	systems
3
What do we really mean by real time?
Aims
Aim	is	to	react	to	events	as	they	happen	in	real-:me	
Where	do	events	happen/arrive?	
Message	bus	
What’s	a	reac:on?	
An	ac:on/transforma:on/func:on
4
Compute Representation
Abstract	View
f(x)
Incoming	Messages Output	Messages
5
Traditional Compute representation
DAG
%
%
%
%
%
Source 1
Source 2
Action
Action
Action
Sink 1
Sink 2
6
Traditional Compute API
S=tching	all	of	this	by	programmers
public static class SplitSentence extends BaseBasicBolt {
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word"));
}
@Override
public Map<String, Object> getComponentConfiguration() {
return null;
}
public void execute(Tuple tuple, BasicOutputCollector
basicOutputCollector) {
String sentence = tuple.getStringByField("sentence");
String words[] = sentence.split(" ");
for (String w : words) {
basicOutputCollector.emit(new Values(w));
}
}
}
7
Traditional Compute API
S=tching	all	of	this	by	programmers
public static class WordCount extends BaseBasicBolt {
Map<String, Integer> counts = new HashMap<String, Integer>();
@Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.get(word);
if (count == null)
count = 0;
count++;
counts.put(word, count);
collector.emit(new Values(word, count));
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word", "count"));
}
}
8
Compute API 2.0
Func=onal
Builder.newBuilder()
.newSource(() -> StreamletUtils.randomFromList(SENTENCES))
.flatMap(sentence -> Arrays.asList(sentence.toLowerCase().split("s+")))
.reduceByKeyAndWindow(word -> word, word -> 1,
WindowConfig.TumblingCountWindow(50),
(x, y) -> x + y);
9
Compute API 2.0
Characteris=cs
Compact	
Complicated	
Map	vs	FlatMap
10
Traditional Real-Time Systems
Separate
Messaging Compute
11
Traditional Real-Time Systems
Developer	Experience
Powerful	API	but	complicated	
Does	everyone	really	need	to	learn	func:onal	programming?	
Configurable/Scaleable	but	management	overhead	
Edge	systems	have	resource/manageability	constraints
12
Traditional Real-Time Systems
Opera=onal	Experience
Another	system	to	operate	is	one	too	many	
IOT	deployment	rou:nely	have	thousands	of	edge	systems	
Seman:c	difference	
Mismatch/Duplica:on	between	Systems	
Creates	Developer	and	Operator	Fric:on
13
Lessons learnt
Use	Cases
A	significant	percentage	of	transforma:ons	are	simple	
ETL	
Reac:ve	Services	
Classifica:on	
Real-:me	Aggrega:on	
Event	Rou:ng	
Microservices
14
Meanwhile
The	world	of	cloud
The	emergence	of	Serverless	
Simple	func:on	API	
Func:ons	are	submiPed	to	the	system	
Runs	per	event	
Composi:on	APIs	to	do	complex	things	
Wildly	popular
15
Serverless vs Streaming
What	is	the	difference?
Both	are	event	driven	architectures	
Both	can	be	used	for	analy:cs/serving	
Both	have	composi:on	APIs	
Configura:on	based	for	Serverless	vs	DSL	based	for	Streaming	
Serverless	typically	don’t	care	for	ordering	
Really	the	func:on	of	the	underlying	source	
Pay	per	ac:on	
Really	a	product	billing	interfaces
16
What’s needed? Stream-Native Compute
Insight	gained	from	Serverless
Simplest	possible	API	
Method/Procedure/Func:on	
Mul:	Language	API	
Scale	developers	
Message	bus	na:ve	concepts	
Input/Output/Log	as	topics	
Flexible	run:me	
Simple	standalone	applica:ons	vs	system	managed	applica:ons
17
Introducing Apache Pulsar Functions
18
Apache Pulsar
19
What is Apache Pulsar?
Hyper Converged Data Platform that includes
Messaging
Durable log storage
Light weight Processing
Open	Source
20
Ordering
Guaranteed ordering
Multi-tenancy
A single cluster can
support many tenants
and use cases
High throughput
Can reach 1.8 M
messages/s in a
single partition
Durability
Data replicated and
synced to disk
Geo-replication
Out of box support for
geographically
distributed
applications
Unified messaging
model
Support both
Streaming and
Queuing in a single
model
Delivery Guarantees
At least once, at most
once and effectively once
Low Latency
Low publish latency of
5ms at 99pct
Highly scalable
Can support millions of
topics
How different is Apache Pulsar?
21
Pulsar Architecture
Pulsar	Broker	1 Pulsar	Broker	1 Pulsar	Broker	1
Bookie	1 Bookie	2 Bookie	3 Bookie	4 Bookie	5
Apache	BookKeeper
Apache	Pulsar
Producer	 Consumer	
Stateless	Serving
BROKER	
Clients interact only with brokers
No state is stored in brokers
BOOKIES	
Apache BookKeeper as the storage
Storage is append only
Provides high performance, low latency
Durability	
No data loss. fsync before acknowledgement
22
Pulsar Architecture
Pulsar	Broker	1 Pulsar	Broker	1 Pulsar	Broker	1
Bookie	1 Bookie	2 Bookie	3 Bookie	4 Bookie	5
Apache	BookKeeper
Apache	Pulsar
Producer	 Consumer	
Separa=on	of	Storage	and	Serving
SERVING
Brokers can be added independently
Traffic can be shifted quickly across brokers
STORAGE	
Bookies can be added independently
New bookies will ramp up traffic quickly
23
Segment Centric Storage
24
Flexible Messaging Model
25
Multi Tenancy
26
Topic	(T1) Topic	(T1)
Topic	(T1)
Subscrip:on	(S1) Subscrip:on	(S1)
Producer		
(P1)
Consumer		
(C1)
Producer		
(P3)
Producer		
(P2)
Consumer		
(C2)
Data	Center	A Data	Center	B
Data	Center	C
Multi Cluster Replication
27
Back to Pulsar Functions
28
Pulsar Functions
API
SDK	less	API	
import java.util.function.Function;
public class ExclamationFunction implements Function<String, String> {
@Override
public String apply(String input) {
return input + "!";
}
}
29
Pulsar Functions
API
SDK	API	
import org.apache.pulsar.functions.api.PulsarFunction;
import org.apache.pulsar.functions.api.Context;
public class ExclamationFunction implements PulsarFunction<String, String> {
@Override
public String process(String input, Context context) {
return input + "!";
}
}
30
Pulsar Functions
Input	and	Output
Func:on	executed	for	every	message	of	input	topic	
Supports	mul:ple	topics	as	inputs	
Func:on	Output	goes	to	the	output	topic	
Func:on	Output	can	be	void/null	
SerDe	takes	care	of	serializa:on/deserializa:on	of	messages	
Custom	SerDe	can	be	provided	by	the	users	
Integrates	with	Schema	Registry
31
Pulsar Functions
Processing	Guarantees
ATMOST_ONCE	
Message	is	acked	to	Pulsar	as	soon	as	we	receive	it	
ATLEAST_ONCE	
Message	acked	to	Pulsar	a]er	the	func:on	completes	
Default	behavior:-	Not	many	ppl	want	to	loose	data	
EFFECTIVELY_ONCE	
Uses	Pulsar’s	inbuilt	effec:vely	once	seman:cs	
Controlled	at	run:me	by	user
32
Pulsar Functions
Built	in	State
Func:ons	can	store	state	in	StreamStore	
Framework	provides	an	simple	library	around	this	
Support	server	side	opera:ons	like	counters	
Simplified	applica:on	development	
No	need	to	standup	an	extra	system
33
Pulsar Functions
WordCount	Topology
import org.apache.pulsar.functions.api.Context;
import org.apache.pulsar.functions.api.PulsarFunction;
public class CounterFunction implements PulsarFunction<String, Void> {
@Override
public Void process(String input, Context context) throws Exception {
for (String word : input.split(".")) {
context.incrCounter(word, 1);
}
return null;
}
}
34
Built-in State Management
Pulsar	uses	BookKeeper	as	its	stream	storage	
Func:ons	can	store	State	in	BookKeeper	
Framework	provides	the	Context	object	for	users	to	access	State	
Support	server	side	opera:ons	like	Counters	
Simplified	applica:on	development	
No	need	to	standup	an	extra	system	to	develop/test/integrate/operate
35
Pulsar Functions
Running	as	a	standalone	applica=on
bin/pulsar-admin functions localrun 
--input persistent://sample/standalone/ns1/test_input 
--output persistent://sample/standalone/ns1/test_result 
--className org.mycompany.ExclamationFunction 
--jar myjar.jar
Runs	as	a	standalone	process	
Run	as	many	instances	as	you	want.	Framework	automa:cally	balances	data	
Run	and	manage	via	Mesos/K8/Nomad/your	favorite	tool
36
Pulsar Functions
Running	inside	Pulsar	cluster
‘Create’	and	‘Delete’	Func:ons	in	a	Pulsar	Cluster	
Pulsar	brokers	run	func:ons	as	either	threads/processes/docker	containers	
Unifies	Messaging	and	Compute	cluster	into	one,	significantly	improving	
manageability		
Ideal	match	for	Edge	or	small	startup	environment	
Serverless	in	a	jar
37
Pulsar Functions
Stepping	back:	Where	Pulsar	Func=ons	belong
Powerful/Complicated	systems	have	their	place	
Data	Centers/Cloud	
Complex	analysis	
A	significant	percentage	of	analy:cs/ac:ons	are	mundane	
ETL/Coun:ng/Rou:ng	
Use	simple	tools	for	simple	things
38
Pulsar Functions: Use Cases
Edge	Compu=ng
Sensor	devices	generate	tons	of	data	
We	need	local	ac:ons	
Simple	filtering,	threshold	detec:on,	regex	matching,	etc	
Manageability	is	a	big	concern	
The	less	moving	parts,	the	bePer	
Resource	Constrained	
Limited	scope	for	Full	blown	schedulers/Job	Managers
39
Pulsar Functions: Use Cases
Model	Serving
Models	computed	via	offline	analysis	
Incoming	requests	should	be	classified	using	the	model	
Func:on	is	a	natural	representa:on	for	the	classifica:on	ac:on	
Model	itself	can	be	stored	in	Bookkeeper
40
Roadmap
More	language	supports	-	Go,	Javascript,	C++	
Cross	Func:ons	:	Func:on	Composi:on	API	
More	State	opera:ons	exposed	to	Func:ons
41
Apache Pulsar in Production
3+	years	
Serves	2.3	million	topics	
100	billion	messages/day	
Average	latency	<	5	ms	
99%	15	ms	(strong	durability	guarantees)	
Zero	data	loss	
80+	applica:ons	
Self	served	provisioning	
Full-mesh	cross-datacenter	replica:on	-	8+	data	centers
42
Companies Using Apache Pulsar and BookKeeper
43
Conclusion
Stream-Na:ve	Compute	(aka	Func:ons)	is	the	new	paradigm	in	Messaging	Systems	
Stream-Na:ve	Storage	(aka	States)	is	the	new	paradigm	in	Storage	Systems	
Pulsar	Func:ons	bridges	lightweight	compu:ng	capability	into	messaging	and	
storage	system,	which	is	the	trends	that	streaming	applica:ons	need		
hPps://pulsar.incubator.apache.org/docs/latest/func:ons/quickstart/
44
Questions and Thank You!
karthik@streaml.io
45
State Storage w/ BookKeeper
The	built-in	state	management	is	powered	by	Table	Service	in	BookKeeper	
BP-30:	Table	Service	
Originated	for	a	built-in	metadata	management	within	BookKeeper	
Expose	for	general	usage.	e.g.	State	management	for	Pulsar	Func:ons		
Developer	Preview	
Pulsar	Func:ons	at	Pulsar	2.0	
Direct	usage	at	BookKeeper	4.7
46
State Storage w/ BookKeeper
Updates	are	wriPen	in	the	log	streams	in	BookKeeper	
Materialized	into	a	key/value	table	view	
The	key/value	table	is	indexed	with	rocksdb	for	fast	lookup	
The	source-of-truth	is	the	log	streams	in	BookKeeper	
Rocksdb	are	transient	key/value	indexes	
Rocksdb	instances	are	incrementally	checkpointed	and	stored	into	BookKeeper	for	
fast	recovery

Data Con LA 2018 - A Serverless Approach to Data Processing using Apache Pulsar by Karthik Ramasamy