Building a Streaming Microservice
Architecture: With Spark
Structured Streaming and Friends
Scott Haines
Senior Principal Software Engineer
Introductions
▪ I work at Twilio
▪ Over 10 years working on Streaming
Architectures
▪ Helped Bring Streaming-First Spark Architecture
to Voice & Voice Insights
▪ Leads Spark Office Hours @ Twilio
▪ Loves Distributed Systems
About Me
Scott Haines: Senior Principal Software Engineer @newfront
Agenda
The Big Picture
What the Architecture looks like
Protocol Buffers
What they are. Why they rule!
GRPC / Protocol Streams
Versioned Data Lineage as a Service
How this fits into Spark
Structured Streaming with Protobuf support
The Big Picture
Streaming Microservice Architecture
GRPC Client
GRPC Server GRPC Server GRPC Server
1
2
3
Kafka Broker
4
Kafka Broker
5
6
Spark Application
7 8
HDFS
S39
HTTP /2
Streaming Microservice Architecture
Kafka Topic Kafka Topic
Spark Application Spark Application Spark Application
Kafka Topic
Data Table Data Table
Spark Application
GRPC Server
Protocol Buffers aka protobuf
Protocol Buffers
▪ Strict Types
▪ Enforce structure at compile time
▪ Similar to StructType in Apache Spark
▪ Interoperable with Spark via ExpressionEncoding extension
▪ Versioning API / Data Pipeline
▪ Compiled protobuf (*.proto) can be released like normal code
▪ Interoperable
▪ Pick your favorite programming language and compile and release.
▪ Supports Java, Scala, C++, Go, Obj-C, Node-JS, Python and more
Why use them?
Protocol Buffers
▪ Code Gen
▪ Automatically generate Builder classes
▪ Being lazy is okay!
▪ Optimized
▪ Messages are optimized and ship with their own
Serialization/Deserialization mechanics (SerDe)
Why use them?
GRPC and Protocol Streams
gRPC
▪ High Performance
▪ Compact Binary Exchange Format
▪ Make API Calls to the Server like they were Client local
▪ Cross Language/Cross Platform
▪ Autogenerate API definitions for idiomatic client and server – just
implement the interfaces
▪ Bi-Directional Streaming
▪ Pluggable support for streaming with HTTP/2 transport
What is it?
GRPC Client
GRPC Server GRPC Server GRPC Server
HTTP /2
GRPC Example: AdTracking
GRPC
▪ Define Messages
▪ What kind of Data are your sending?
▪ Example: Click Tracking / Impression Tracking
▪ What is necessary for the public interface?
▪ Example: AdImpression and Response
How it works?
GRPC
▪ Service Definition
▪ Compile your rpc definition to generate Service Interfaces
▪ Uses the Same protobuf definition (service.proto) as your
Client/Server request and response objects
▪ Can be used to create a binding Service Contract within your
organization or publicly
How it works?
GRPC
▪ Implement the Service
▪ Compilation of the Service auto-generates your
interfaces.
▪ Just implement the service contracts.
How it works?
GRPC
▪ Protocol Streams
▪ Messages (protobuf) are emitted to Kafka topic(s)
from the Server Layer
▪ Protocol Streams are now available from the Kafka
Topics bound to a given Service / Collection of
Messages
▪ Sets up Spark for the Hand-Off
How it works?
GRPC
System Architecture
GRPC Client
GRPC Server GRPC Server GRPC Server
Kafka Broker
Kafka Broker
6
HTTP /2
Topic: ads.click.stream
Client: service.adTrack(trackedAd)
Server: ClickTrackService.adTrack(trackedAd)
Structuring Protocol Streams:
with Structured Streaming
and protobuf
Structured Streaming with Protobuf
▪ Expression Encoding
▪ Natively Interop with Protobuf in Apache Spark.
▪ Protobuf to Case Class conversion from
scalapb.
▪ Product encoding comes for free via import
sparkSession.implicits._
From Protocol Buffer to StructType through ExpressionEncoders
Structured Streaming with Protobuf
▪ Native is Better
▪ Strict Native Kafka to DataFrame conversion with no need
for transformation to intermediary types
▪ Mutations and Joins can be done across DataFrame or
Datasets API.
▪ Create RealTime Data Pipelines, Machine Learning
Pipelines and More.
▪ Rest at Night knowing the pipelines are safe!
From Protocol Buffer to StructType through ExpressionEncoders
Structured Streaming with Protobuf
▪ Strict Data Writer
▪ Compiled / Versioned Protobuf can be used to strictly
enforce the format of your Writers even
▪ Use Protobuf to define the StructType that can be used in
your conversions to *Parquet. (* must abide by parquet
nesting rules )
▪ Declarative Input / Output means that Streaming
Applications don’t go down due to incompatible Data
Streams
▪ Can also be used with Delta so that the version of the
schema lines up with compiled Protobuf.
From Protocol Buffer to StructType through ExpressionEncoders
Structured Streaming with Protobuf
▪ Real World Use Case
▪ Close of Books Data Lineage Job
▪ Uses End to End Protobuf
▪ Enables teams to move quick with guarantees regarding
the Data being published and at what Frequency
▪ Can be emitted at different speeds to different locations
based on configuration
Example: Streaming Transformation Pipeline
Streaming Microservice Architecture
GRPC Client
GRPC Server GRPC Server GRPC Server
1
2
3
Kafka Broker
4
Kafka Broker
5
6
Spark Application
7 8
HDFS
S39
HTTP /2
Recap
What We Learned
▪ Language
Agnostic
Structured Data
▪ Compile Time
Guarantees
▪ Lightning Fast
Serialization/Dese
rialization
▪ Language
Agnostic Binary
Services
▪ Low-Latency
▪ Compile Time
Guarantees
▪ Smart Framework
GRPCProtobuf
▪ Highly Available
▪ Native Connector
for Spark
▪ Topic Based Binary
Protobuf Store
▪ Use to Pass
Records to one or
more Downstream
Services
Kafka
▪ Handle Data
Reliably
▪ Protobuf to
Dataset /
DataFrames is
awesome
▪ Parquet / Delta
plays nice as
Columnar Data
Exchange format
Structured Streaming
Thanks @newfrontcreative
@newfront
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

Building a Streaming Microservice Architecture: with Apache Spark Structured Streaming and Friends

  • 2.
    Building a StreamingMicroservice Architecture: With Spark Structured Streaming and Friends Scott Haines Senior Principal Software Engineer
  • 3.
    Introductions ▪ I workat Twilio ▪ Over 10 years working on Streaming Architectures ▪ Helped Bring Streaming-First Spark Architecture to Voice & Voice Insights ▪ Leads Spark Office Hours @ Twilio ▪ Loves Distributed Systems About Me Scott Haines: Senior Principal Software Engineer @newfront
  • 4.
    Agenda The Big Picture Whatthe Architecture looks like Protocol Buffers What they are. Why they rule! GRPC / Protocol Streams Versioned Data Lineage as a Service How this fits into Spark Structured Streaming with Protobuf support
  • 5.
  • 6.
    Streaming Microservice Architecture GRPCClient GRPC Server GRPC Server GRPC Server 1 2 3 Kafka Broker 4 Kafka Broker 5 6 Spark Application 7 8 HDFS S39 HTTP /2
  • 7.
    Streaming Microservice Architecture KafkaTopic Kafka Topic Spark Application Spark Application Spark Application Kafka Topic Data Table Data Table Spark Application GRPC Server
  • 8.
  • 9.
    Protocol Buffers ▪ StrictTypes ▪ Enforce structure at compile time ▪ Similar to StructType in Apache Spark ▪ Interoperable with Spark via ExpressionEncoding extension ▪ Versioning API / Data Pipeline ▪ Compiled protobuf (*.proto) can be released like normal code ▪ Interoperable ▪ Pick your favorite programming language and compile and release. ▪ Supports Java, Scala, C++, Go, Obj-C, Node-JS, Python and more Why use them?
  • 10.
    Protocol Buffers ▪ CodeGen ▪ Automatically generate Builder classes ▪ Being lazy is okay! ▪ Optimized ▪ Messages are optimized and ship with their own Serialization/Deserialization mechanics (SerDe) Why use them?
  • 11.
  • 12.
    gRPC ▪ High Performance ▪Compact Binary Exchange Format ▪ Make API Calls to the Server like they were Client local ▪ Cross Language/Cross Platform ▪ Autogenerate API definitions for idiomatic client and server – just implement the interfaces ▪ Bi-Directional Streaming ▪ Pluggable support for streaming with HTTP/2 transport What is it? GRPC Client GRPC Server GRPC Server GRPC Server HTTP /2
  • 13.
  • 14.
    GRPC ▪ Define Messages ▪What kind of Data are your sending? ▪ Example: Click Tracking / Impression Tracking ▪ What is necessary for the public interface? ▪ Example: AdImpression and Response How it works?
  • 15.
    GRPC ▪ Service Definition ▪Compile your rpc definition to generate Service Interfaces ▪ Uses the Same protobuf definition (service.proto) as your Client/Server request and response objects ▪ Can be used to create a binding Service Contract within your organization or publicly How it works?
  • 16.
    GRPC ▪ Implement theService ▪ Compilation of the Service auto-generates your interfaces. ▪ Just implement the service contracts. How it works?
  • 17.
    GRPC ▪ Protocol Streams ▪Messages (protobuf) are emitted to Kafka topic(s) from the Server Layer ▪ Protocol Streams are now available from the Kafka Topics bound to a given Service / Collection of Messages ▪ Sets up Spark for the Hand-Off How it works?
  • 18.
    GRPC System Architecture GRPC Client GRPCServer GRPC Server GRPC Server Kafka Broker Kafka Broker 6 HTTP /2 Topic: ads.click.stream Client: service.adTrack(trackedAd) Server: ClickTrackService.adTrack(trackedAd)
  • 19.
    Structuring Protocol Streams: withStructured Streaming and protobuf
  • 20.
    Structured Streaming withProtobuf ▪ Expression Encoding ▪ Natively Interop with Protobuf in Apache Spark. ▪ Protobuf to Case Class conversion from scalapb. ▪ Product encoding comes for free via import sparkSession.implicits._ From Protocol Buffer to StructType through ExpressionEncoders
  • 21.
    Structured Streaming withProtobuf ▪ Native is Better ▪ Strict Native Kafka to DataFrame conversion with no need for transformation to intermediary types ▪ Mutations and Joins can be done across DataFrame or Datasets API. ▪ Create RealTime Data Pipelines, Machine Learning Pipelines and More. ▪ Rest at Night knowing the pipelines are safe! From Protocol Buffer to StructType through ExpressionEncoders
  • 22.
    Structured Streaming withProtobuf ▪ Strict Data Writer ▪ Compiled / Versioned Protobuf can be used to strictly enforce the format of your Writers even ▪ Use Protobuf to define the StructType that can be used in your conversions to *Parquet. (* must abide by parquet nesting rules ) ▪ Declarative Input / Output means that Streaming Applications don’t go down due to incompatible Data Streams ▪ Can also be used with Delta so that the version of the schema lines up with compiled Protobuf. From Protocol Buffer to StructType through ExpressionEncoders
  • 23.
    Structured Streaming withProtobuf ▪ Real World Use Case ▪ Close of Books Data Lineage Job ▪ Uses End to End Protobuf ▪ Enables teams to move quick with guarantees regarding the Data being published and at what Frequency ▪ Can be emitted at different speeds to different locations based on configuration Example: Streaming Transformation Pipeline
  • 24.
    Streaming Microservice Architecture GRPCClient GRPC Server GRPC Server GRPC Server 1 2 3 Kafka Broker 4 Kafka Broker 5 6 Spark Application 7 8 HDFS S39 HTTP /2
  • 25.
  • 26.
    What We Learned ▪Language Agnostic Structured Data ▪ Compile Time Guarantees ▪ Lightning Fast Serialization/Dese rialization ▪ Language Agnostic Binary Services ▪ Low-Latency ▪ Compile Time Guarantees ▪ Smart Framework GRPCProtobuf ▪ Highly Available ▪ Native Connector for Spark ▪ Topic Based Binary Protobuf Store ▪ Use to Pass Records to one or more Downstream Services Kafka ▪ Handle Data Reliably ▪ Protobuf to Dataset / DataFrames is awesome ▪ Parquet / Delta plays nice as Columnar Data Exchange format Structured Streaming
  • 27.
  • 28.
    Feedback Your feedback isimportant to us. Don’t forget to rate and review the sessions.