Building a Streaming Microservice Architecture: with Apache Spark Structured Streaming and Friends
The document outlines a streaming microservice architecture utilizing Spark Structured Streaming, gRPC, and Protocol Buffers. It explains the benefits of Protocol Buffers for data serialization and the integration of gRPC for efficient client-server communication, enabling real-time data pipelining and robust data lineage. Key topics include the architecture's components, strict data typing, and end-to-end use cases for improved data handling in various programming languages.
Building a Streaming Microservice Architecture: with Apache Spark Structured Streaming and Friends
2.
Building a StreamingMicroservice
Architecture: With Spark
Structured Streaming and Friends
Scott Haines
Senior Principal Software Engineer
3.
Introductions
▪ I workat Twilio
▪ Over 10 years working on Streaming
Architectures
▪ Helped Bring Streaming-First Spark Architecture
to Voice & Voice Insights
▪ Leads Spark Office Hours @ Twilio
▪ Loves Distributed Systems
About Me
Scott Haines: Senior Principal Software Engineer @newfront
4.
Agenda
The Big Picture
Whatthe Architecture looks like
Protocol Buffers
What they are. Why they rule!
GRPC / Protocol Streams
Versioned Data Lineage as a Service
How this fits into Spark
Structured Streaming with Protobuf support
Protocol Buffers
▪ StrictTypes
▪ Enforce structure at compile time
▪ Similar to StructType in Apache Spark
▪ Interoperable with Spark via ExpressionEncoding extension
▪ Versioning API / Data Pipeline
▪ Compiled protobuf (*.proto) can be released like normal code
▪ Interoperable
▪ Pick your favorite programming language and compile and release.
▪ Supports Java, Scala, C++, Go, Obj-C, Node-JS, Python and more
Why use them?
10.
Protocol Buffers
▪ CodeGen
▪ Automatically generate Builder classes
▪ Being lazy is okay!
▪ Optimized
▪ Messages are optimized and ship with their own
Serialization/Deserialization mechanics (SerDe)
Why use them?
gRPC
▪ High Performance
▪Compact Binary Exchange Format
▪ Make API Calls to the Server like they were Client local
▪ Cross Language/Cross Platform
▪ Autogenerate API definitions for idiomatic client and server – just
implement the interfaces
▪ Bi-Directional Streaming
▪ Pluggable support for streaming with HTTP/2 transport
What is it?
GRPC Client
GRPC Server GRPC Server GRPC Server
HTTP /2
GRPC
▪ Define Messages
▪What kind of Data are your sending?
▪ Example: Click Tracking / Impression Tracking
▪ What is necessary for the public interface?
▪ Example: AdImpression and Response
How it works?
15.
GRPC
▪ Service Definition
▪Compile your rpc definition to generate Service Interfaces
▪ Uses the Same protobuf definition (service.proto) as your
Client/Server request and response objects
▪ Can be used to create a binding Service Contract within your
organization or publicly
How it works?
16.
GRPC
▪ Implement theService
▪ Compilation of the Service auto-generates your
interfaces.
▪ Just implement the service contracts.
How it works?
17.
GRPC
▪ Protocol Streams
▪Messages (protobuf) are emitted to Kafka topic(s)
from the Server Layer
▪ Protocol Streams are now available from the Kafka
Topics bound to a given Service / Collection of
Messages
▪ Sets up Spark for the Hand-Off
How it works?
18.
GRPC
System Architecture
GRPC Client
GRPCServer GRPC Server GRPC Server
Kafka Broker
Kafka Broker
6
HTTP /2
Topic: ads.click.stream
Client: service.adTrack(trackedAd)
Server: ClickTrackService.adTrack(trackedAd)
Structured Streaming withProtobuf
▪ Expression Encoding
▪ Natively Interop with Protobuf in Apache Spark.
▪ Protobuf to Case Class conversion from
scalapb.
▪ Product encoding comes for free via import
sparkSession.implicits._
From Protocol Buffer to StructType through ExpressionEncoders
21.
Structured Streaming withProtobuf
▪ Native is Better
▪ Strict Native Kafka to DataFrame conversion with no need
for transformation to intermediary types
▪ Mutations and Joins can be done across DataFrame or
Datasets API.
▪ Create RealTime Data Pipelines, Machine Learning
Pipelines and More.
▪ Rest at Night knowing the pipelines are safe!
From Protocol Buffer to StructType through ExpressionEncoders
22.
Structured Streaming withProtobuf
▪ Strict Data Writer
▪ Compiled / Versioned Protobuf can be used to strictly
enforce the format of your Writers even
▪ Use Protobuf to define the StructType that can be used in
your conversions to *Parquet. (* must abide by parquet
nesting rules )
▪ Declarative Input / Output means that Streaming
Applications don’t go down due to incompatible Data
Streams
▪ Can also be used with Delta so that the version of the
schema lines up with compiled Protobuf.
From Protocol Buffer to StructType through ExpressionEncoders
23.
Structured Streaming withProtobuf
▪ Real World Use Case
▪ Close of Books Data Lineage Job
▪ Uses End to End Protobuf
▪ Enables teams to move quick with guarantees regarding
the Data being published and at what Frequency
▪ Can be emitted at different speeds to different locations
based on configuration
Example: Streaming Transformation Pipeline
What We Learned
▪Language
Agnostic
Structured Data
▪ Compile Time
Guarantees
▪ Lightning Fast
Serialization/Dese
rialization
▪ Language
Agnostic Binary
Services
▪ Low-Latency
▪ Compile Time
Guarantees
▪ Smart Framework
GRPCProtobuf
▪ Highly Available
▪ Native Connector
for Spark
▪ Topic Based Binary
Protobuf Store
▪ Use to Pass
Records to one or
more Downstream
Services
Kafka
▪ Handle Data
Reliably
▪ Protobuf to
Dataset /
DataFrames is
awesome
▪ Parquet / Delta
plays nice as
Columnar Data
Exchange format
Structured Streaming