Kubernetes + Operator + PaaSTA =
Flink@Yelp
Oct 9, 2019
Yelp’s Mission
Connecting
people with great
local businesses
What you’ll see
What you’ll see What Flink at Yelp looks like
What Yelp uses Flink for and what using Flink at Yelp
looks like
WHAT YOU’LL SEE
What you’ll see What Flink at Yelp looks like
What Yelp uses Flink for and what using Flink at Yelp
looks like
How Kubernetes can power Flink
How Kubernetes and Operators can be used to power
Flink clusters deployment and operations
WHAT YOU’LL SEE
What you’ll see What Flink at Yelp looks like
What Yelp uses Flink for and what using Flink at Yelp
looks like
How Kubernetes can power Flink
Why platform integration matters
How Kubernetes and Operators can be used to power
Flink clusters deployment and operations
Why integrating Flink with Yelp’s platform as a service
(PaaSTA) is the key to unlock value for the users
WHAT YOU’LL SEE
Flink@Yelp
FLINK@YELP
Powering Data Enrichment and Transformation as a Service
StreamSQL manipulations and multi-stream
unwindowed joins as a service
FLINK@YELP
Powering Data Enrichment and Transformation as a Service
StreamSQL manipulations and multi-stream
unwindowed joins as a service
Real-time Notifications
Customized push notification to suggest relevant
businesses nearby
FLINK@YELP
Powering Data Enrichment and Transformation as a Service
StreamSQL manipulations and multi-stream
unwindowed joins as a service
Real-time Notifications
User Activity Sessions
Customized push notification to suggest relevant
businesses nearby
Multi-platform user activity sessions out of event logs
FLINK@YELP
Powering
Connectors
FLINK@YELP
Powering
Connectors
FLINK@YELP
Powering
Connectors
FLINK@YELP
Powering
Connectors
FLINK@YELP
Powering
Connectors
FLINK@YELP
Powering
Connectors
FLINK@YELP
The scale ~10 apps
FLINK@YELP
The scale ~10 apps
~50 clusters
FLINK@YELP
~1000 jobs
The scale ~10 apps
~50 clusters
The Status Quo
THE STATUS QUO
Flink on
AWS EMR
THE STATUS QUO
Meh. Both complex and slow
Running a dockerized Puppet monolith, 15 minutes
boot time and depending on AWS for Flink updates
THE STATUS QUO
Meh. Both complex and slow
Running a dockerized Puppet monolith, 15 minutes
boot time and depending on AWS for Flink updates
Still pretty manual
Each cluster needs trained operators to manually
deploy new versions or scale up resources
THE STATUS QUO
Meh. Both complex and slow
Running a dockerized Puppet monolith, 15 minutes
boot time and depending on AWS for Flink updates
Still pretty manual
Just different
Each cluster needs trained operators to manually
deploy new versions or scale up resources
Different UX and infrastructure from the rest of Yelp led
to high barrier to entry and knowledge impedance
Meet
Kubernetes
MEET KUBERNETES
Hello, I’m... an open-source system for automating deployment, scaling, and
management of containerized applications.
(The Internet)
MEET KUBERNETES
I like... Horizontal scaling
Scale applications up and down with a simple
command or automatically based on CPU usage
MEET KUBERNETES
I like... Horizontal scaling
Scale applications up and down with a simple
command or automatically based on CPU usage
Self-healing systems
Restart containers that fails, reschedule them when
nodes die, support user-defined health-checks
MEET KUBERNETES
I like... Horizontal scaling
Scale applications up and down with a simple
command or automatically based on CPU usage
Self-healing systems
Powerful primitives
Restart containers that fails, reschedule them when
nodes die, support user-defined health-checks
Pods, ReplicaSets, Services, Jobs and friends can be
used to model complex applications and workflows
MEET KUBERNETES
My hobbies are... Automatic bin packing
Place containers based on their requirements and
constraints, to drive up utilization and save resources
MEET KUBERNETES
My hobbies are... Automatic bin packing
Place containers based on their requirements and
constraints, to drive up utilization and save resources
Service discovery and load balancing
Give pods their own IP and a single DNS name for a set
of Pods and can load-balance across them
MEET KUBERNETES
My hobbies are... Automatic bin packing
Place containers based on their requirements and
constraints, to drive up utilization and save resources
Service discovery and load balancing
Storage orchestration
Give pods their own IP and a single DNS name for a set
of Pods and can load-balance across them
Automatically mount the storage system of your choice
and maintain state across application restarts
Assembling
Flink
Clusters
ASSEMBLING FLINK CLUSTERS
Job Manager is a Deployment of a Pod
Job Manager
Pod
Co-located group of containers with shared storage,
network and a spec for how to run the containers
is a Deployment of a Pod
ASSEMBLING FLINK CLUSTERS
Job Manager
Pod
Co-located group of containers with shared storage,
network and a spec for how to run the containers
is a Deployment of a Pod
Deployments
Provides declarative updates for Pods and ReplicaSets
to automate containers deployments and rollbacks
ASSEMBLING FLINK CLUSTERS
ASSEMBLING FLINK CLUSTERS
Task Managers are a Deployment of a ReplicaSet
Task Managers are a Deployment of a ReplicaSet
ReplicaSets
Maintain a stable set of identical Pods running at any
given time
ASSEMBLING FLINK CLUSTERS
Static IPs or DNS are replaced by a Service and a Proxy
ASSEMBLING FLINK CLUSTERS
Static IPs or DNS
Service
Exposes an application running on a set of Pods as a
network service regardless of their ephemeral IPs
are replaced by a Service and a Proxy
ASSEMBLING FLINK CLUSTERS
Static IPs or DNS
Service
Exposes an application running on a set of Pods as a
network service regardless of their ephemeral IPs
are replaced by a Service and a Proxy
Kube-proxy
Network proxy running on each node reflecting
Services and doing port-forwarding and round-robin
ASSEMBLING FLINK CLUSTERS
Flink jobs are deployed by the Supervisor
ASSEMBLING FLINK CLUSTERS
Flink jobs
Flink Supervisor
Yelp’s in-house daemon responsible of deployment,
state management and monitoring of Flink jobs on EMR
are deployed by Supervisor
ASSEMBLING FLINK CLUSTERS
Cluster shutdown is signaled via a Job
ASSEMBLING FLINK CLUSTERS
Cluster shutdown
Jobs
Create Pods and ensure that a specified number of
them successfully terminate.
is signaled via a Job
ASSEMBLING FLINK CLUSTERS
Kubernetes
Operators
software extensions to Kubernetes that make use of custom
resources to manage applications and their components.
(The Internet)
Operators are...
KUBERNETES OPERATORS
Human VS K8s
manages a service or a
set of services
KUBERNETES OPERATORS
manages a service or a
set of services
Kubernetes OperatorHuman Operator
Human VS K8s
manages a service or a
set of services
KUBERNETES OPERATORS
manages a service or a
set of services
Kubernetes OperatorHuman Operator
has deep knowledge of
how the system is
expected to behave
has deep knowledge of
how the system is
expected to behave
Human VS K8s
manages a service or a
set of services
KUBERNETES OPERATORS
manages a service or a
set of services
Kubernetes OperatorHuman Operator
has deep knowledge of
how the system is
expected to behave
knows how to deploy it
has deep knowledge of
how the system is
expected to behave
knows how to deploy it
Human VS K8s
manages a service or a
set of services
KUBERNETES OPERATORS
manages a service or a
set of services
Kubernetes OperatorHuman Operator
has deep knowledge of
how the system is
expected to behave
knows how to deploy it
knows how to react if
there are problems
has deep knowledge of
how the system is
expected to behave
knows how to deploy it
knows how to react if
there are problems
Human VS K8s
manages a service or a
set of services
KUBERNETES OPERATORS
manages a service or a
set of services
Kubernetes OperatorHuman Operator
has deep knowledge of
how the system is
expected to behave
knows how to deploy it
knows how to react if
there are problems
automates repetitive
tasks
has deep knowledge of
how the system is
expected to behave
knows how to deploy it
knows how to react if
there are problems
uses automation for
repetitive tasks
Human VS K8s
manages a service or a
set of services
KUBERNETES OPERATORS
manages a service or a
set of services
Kubernetes OperatorHuman Operator
has deep knowledge of
how the system is
expected to behave
knows how to deploy it
knows how to react if
there are problems
automates repetitive
tasks
has deep knowledge of
how the system is
expected to behave
knows how to deploy it
knows how to react if
there are problems
uses automation for
repetitive tasks
can only manage a
limited number of
instances
can manage a very
high number of
instances
Flink Custom
Resource
Declarative model
Model the configuration and the deployment of a Flink
cluster
KUBERNETES OPERATORS
Flink Custom
Resource
Declarative model
Model the configuration and the deployment of a Flink
cluster
State representation
Used by the operator to keep track of the state of any
Flink cluster
KUBERNETES OPERATORS
Flink Custom
Resource
Declarative model
Model the configuration and the deployment of a Flink
cluster
State representation
Labels and Annotations
Used by the operator to keep track of the state of any
Flink cluster
Used for selecting the components to update or to signal
that the user requested a shutdown
KUBERNETES OPERATORS
Flink Dashboard is accessible via an Ingress rule
ASSEMBLING FLINK CLUSTERS
Flink Dashboard
Ingress
Exposes HTTP and HTTPS routes from outside the
cluster to services within the cluster
is accessible via an Ingress rule
ASSEMBLING FLINK CLUSTERS
Flink Dashboard
Ingress
Exposes HTTP and HTTPS routes from outside the
cluster to services within the cluster
is accessible via an Ingress rule
ASSEMBLING FLINK CLUSTERS
Ingress Controller
Ingresses and ingress rules are managed by their own
“operator”
Yelp PaaSTA
YELP PAASTA
PaaSTA is...
a highly-available, distributed system for building, deploying, and
running services using containers and Apache Mesos.
(Yelp)
YELP PAASTA
PaaSTA is...
a highly-available, distributed system for building, deploying, and
running services using containers and Apache Mesos Kubernetes.
(Yelp)
YELP PAASTA
Why integrating? Consistent interface
Every PaaSTA user knows how to interact with any
service regardless of its nature
YELP PAASTA
Why integrating? Consistent interface
Every PaaSTA user knows how to interact with any
service regardless of its nature
Infrastructure as a Service
Whether it is a Web server, a Cassandra cluster or a
Flink job, to the user everything is a service
YELP PAASTA
Why integrating? Consistent interface
Every PaaSTA user knows how to interact with any
service regardless of its nature
Infrastructure as a Service
Platform engineers are users too
Whether it is a Web server, a Cassandra cluster or a
Flink job, to the user everything is a service
Shared infrastructure and tools are exposed as
services, libraries and CLIs to platform developers
main:
job_type: stateful
checkpoint_interval_ms : 30000
deploy_group: prod
taskmanager:
cpus: 2.0
mem: 10G
instances: 3
checkpoint_path : s3://flink-state/service/main/checkpoints
savepoint_path : s3://flink-state/service/main/savepoints
flink_conf:
taskmanager.network.detailed-metrics : "true"
env.java.opts.taskmanager : "-XX:+UseConcMarkSweepGC"
main:
job_type: stateful
checkpoint_interval_ms : 30000
deploy_group: prod
taskmanager:
cpus: 2.0
mem: 10G
instances: 3
checkpoint_path : s3://flink-state/service/main/checkpoints
savepoint_path : s3://flink-state/service/main/savepoints
flink_conf:
taskmanager.network.detailed-metrics : "true"
env.java.opts.taskmanager : "-XX:+UseConcMarkSweepGC"
Custom
Resource
Definition
YELP PAASTA
User Interaction Check status
paasta status -s service -i instance -r region
paasta logs -s service -i instance -n 100
YELP PAASTA
User Interaction Check status
paasta status -s service -i instance -r region
Read logs
paasta logs -s service -i instance -n 100
YELP PAASTA
User Interaction Check status
paasta status -s service -i instance -r region
Read logs
Deploy a new version
Different UX and infrastructure from the rest of Yelp led
to high barrier to entry and knowledge impedance
git commit && git push origin master
The Future
THE FUTURE
Python
on Beam
on Flink
on Kubernetes
THE FUTURE
Pipeline
Builder
THE FUTURE
Pipeline
Builder
THE FUTURE
Pipeline
Builder
THE FUTURE
Pipeline
Builder
THE FUTURE
Pipeline
Builder
THE FUTURE
Pipeline
Builder
What’s next Job Oriented Deployment
More isolation, faster restarts and simpler deployment
by running a single job per Flink cluster
THE FUTURE
What’s next Job Oriented Deployment
More isolation, faster restarts and simpler deployment
by running a single job per Flink cluster
Reactive Container Mode and Autoscaling
Flink will automatically react to new resources available
in K8s by rescaling the job (FLINK-10407)
THE FUTURE
What’s next Job Oriented Deployment
More isolation, faster restarts and simpler deployment
by running a single job per Flink cluster
Reactive Container Mode and Autoscaling
Thinner Supervisor
Flink will automatically react to new resources available
in K8s by rescaling the job (FLINK-10407)
Move savepoints, jobs lifecycle and configuration
management from the Supervisor to the Operator
THE FUTURE
Should I do it?
Let’s do it!
SHOULD I DO IT?
O(1) people for O(N) clusters
A K8s operator allows you to scale up your number of
Flink clusters without adding more human operators
Let’s do it! O(1) people for O(N) clusters
A K8s operator allows you to scale up your number of
Flink clusters without adding more human operators
Operators to codify knowledge
Codifying operational knowledge is easier than passing
it all down to new hires
SHOULD I DO IT?
Let’s do it! O(1) people for O(N) clusters
A K8s operator allows you to scale up your number of
Flink clusters without adding more human operators
Operators to codify knowledge
A catalyst for users
Codifying operational knowledge is easier than passing
it all down to new hires
Once integrated with your platform, users don’t have to
learn how to deploy or configure a Flink job anymore
SHOULD I DO IT?
Or maybe not The Kubernetes Tax
Embedding Kubernetes into your platform requires a
pretty solid effort, if you haven’t done it yet
SHOULD I DO IT?
Or maybe not The Kubernetes Tax
Embedding Kubernetes into your platform requires a
pretty solid effort, if you haven’t done it yet
(Build ∨ Buy) → Time
It takes some time to write your own operator or to fit
an existing one into your platform
SHOULD I DO IT?
Or maybe not The Kubernetes Tax
Embedding Kubernetes into your platform requires a
pretty solid effort, if you haven’t done it yet
(Build ∨ Buy) → Time
It takes some time to write your own operator or to fit
an existing one into your platform
SHOULD I DO IT?
There is always the cloud
Cloud providers are starting to offer managed platforms
based on Kubernetes operators
www.yelp.com/careers/
We're Hiring!
@YelpEngineering
fb.com/YelpEngineers
engineeringblog.yelp.com
github.com/yelp
Questions/Suggestions?
antonio@yelp.com
Thank you.

Kubernetes + Operator + PaaSTA = Flink @ Yelp - Antonio Verardi, Yelp

  • 1.
    Kubernetes + Operator+ PaaSTA = Flink@Yelp Oct 9, 2019
  • 2.
  • 3.
  • 4.
    What you’ll seeWhat Flink at Yelp looks like What Yelp uses Flink for and what using Flink at Yelp looks like WHAT YOU’LL SEE
  • 5.
    What you’ll seeWhat Flink at Yelp looks like What Yelp uses Flink for and what using Flink at Yelp looks like How Kubernetes can power Flink How Kubernetes and Operators can be used to power Flink clusters deployment and operations WHAT YOU’LL SEE
  • 6.
    What you’ll seeWhat Flink at Yelp looks like What Yelp uses Flink for and what using Flink at Yelp looks like How Kubernetes can power Flink Why platform integration matters How Kubernetes and Operators can be used to power Flink clusters deployment and operations Why integrating Flink with Yelp’s platform as a service (PaaSTA) is the key to unlock value for the users WHAT YOU’LL SEE
  • 7.
  • 8.
    FLINK@YELP Powering Data Enrichmentand Transformation as a Service StreamSQL manipulations and multi-stream unwindowed joins as a service
  • 9.
    FLINK@YELP Powering Data Enrichmentand Transformation as a Service StreamSQL manipulations and multi-stream unwindowed joins as a service Real-time Notifications Customized push notification to suggest relevant businesses nearby
  • 10.
    FLINK@YELP Powering Data Enrichmentand Transformation as a Service StreamSQL manipulations and multi-stream unwindowed joins as a service Real-time Notifications User Activity Sessions Customized push notification to suggest relevant businesses nearby Multi-platform user activity sessions out of event logs
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
    FLINK@YELP The scale ~10apps ~50 clusters
  • 19.
    FLINK@YELP ~1000 jobs The scale~10 apps ~50 clusters
  • 20.
  • 21.
  • 22.
    THE STATUS QUO Meh.Both complex and slow Running a dockerized Puppet monolith, 15 minutes boot time and depending on AWS for Flink updates
  • 23.
    THE STATUS QUO Meh.Both complex and slow Running a dockerized Puppet monolith, 15 minutes boot time and depending on AWS for Flink updates Still pretty manual Each cluster needs trained operators to manually deploy new versions or scale up resources
  • 24.
    THE STATUS QUO Meh.Both complex and slow Running a dockerized Puppet monolith, 15 minutes boot time and depending on AWS for Flink updates Still pretty manual Just different Each cluster needs trained operators to manually deploy new versions or scale up resources Different UX and infrastructure from the rest of Yelp led to high barrier to entry and knowledge impedance
  • 25.
  • 26.
    MEET KUBERNETES Hello, I’m...an open-source system for automating deployment, scaling, and management of containerized applications. (The Internet)
  • 27.
    MEET KUBERNETES I like...Horizontal scaling Scale applications up and down with a simple command or automatically based on CPU usage
  • 28.
    MEET KUBERNETES I like...Horizontal scaling Scale applications up and down with a simple command or automatically based on CPU usage Self-healing systems Restart containers that fails, reschedule them when nodes die, support user-defined health-checks
  • 29.
    MEET KUBERNETES I like...Horizontal scaling Scale applications up and down with a simple command or automatically based on CPU usage Self-healing systems Powerful primitives Restart containers that fails, reschedule them when nodes die, support user-defined health-checks Pods, ReplicaSets, Services, Jobs and friends can be used to model complex applications and workflows
  • 30.
    MEET KUBERNETES My hobbiesare... Automatic bin packing Place containers based on their requirements and constraints, to drive up utilization and save resources
  • 31.
    MEET KUBERNETES My hobbiesare... Automatic bin packing Place containers based on their requirements and constraints, to drive up utilization and save resources Service discovery and load balancing Give pods their own IP and a single DNS name for a set of Pods and can load-balance across them
  • 32.
    MEET KUBERNETES My hobbiesare... Automatic bin packing Place containers based on their requirements and constraints, to drive up utilization and save resources Service discovery and load balancing Storage orchestration Give pods their own IP and a single DNS name for a set of Pods and can load-balance across them Automatically mount the storage system of your choice and maintain state across application restarts
  • 33.
  • 36.
    ASSEMBLING FLINK CLUSTERS JobManager is a Deployment of a Pod
  • 37.
    Job Manager Pod Co-located groupof containers with shared storage, network and a spec for how to run the containers is a Deployment of a Pod ASSEMBLING FLINK CLUSTERS
  • 38.
    Job Manager Pod Co-located groupof containers with shared storage, network and a spec for how to run the containers is a Deployment of a Pod Deployments Provides declarative updates for Pods and ReplicaSets to automate containers deployments and rollbacks ASSEMBLING FLINK CLUSTERS
  • 40.
    ASSEMBLING FLINK CLUSTERS TaskManagers are a Deployment of a ReplicaSet
  • 41.
    Task Managers area Deployment of a ReplicaSet ReplicaSets Maintain a stable set of identical Pods running at any given time ASSEMBLING FLINK CLUSTERS
  • 43.
    Static IPs orDNS are replaced by a Service and a Proxy ASSEMBLING FLINK CLUSTERS
  • 44.
    Static IPs orDNS Service Exposes an application running on a set of Pods as a network service regardless of their ephemeral IPs are replaced by a Service and a Proxy ASSEMBLING FLINK CLUSTERS
  • 45.
    Static IPs orDNS Service Exposes an application running on a set of Pods as a network service regardless of their ephemeral IPs are replaced by a Service and a Proxy Kube-proxy Network proxy running on each node reflecting Services and doing port-forwarding and round-robin ASSEMBLING FLINK CLUSTERS
  • 47.
    Flink jobs aredeployed by the Supervisor ASSEMBLING FLINK CLUSTERS
  • 48.
    Flink jobs Flink Supervisor Yelp’sin-house daemon responsible of deployment, state management and monitoring of Flink jobs on EMR are deployed by Supervisor ASSEMBLING FLINK CLUSTERS
  • 50.
    Cluster shutdown issignaled via a Job ASSEMBLING FLINK CLUSTERS
  • 51.
    Cluster shutdown Jobs Create Podsand ensure that a specified number of them successfully terminate. is signaled via a Job ASSEMBLING FLINK CLUSTERS
  • 52.
  • 53.
    software extensions toKubernetes that make use of custom resources to manage applications and their components. (The Internet) Operators are... KUBERNETES OPERATORS
  • 56.
    Human VS K8s managesa service or a set of services KUBERNETES OPERATORS manages a service or a set of services Kubernetes OperatorHuman Operator
  • 57.
    Human VS K8s managesa service or a set of services KUBERNETES OPERATORS manages a service or a set of services Kubernetes OperatorHuman Operator has deep knowledge of how the system is expected to behave has deep knowledge of how the system is expected to behave
  • 58.
    Human VS K8s managesa service or a set of services KUBERNETES OPERATORS manages a service or a set of services Kubernetes OperatorHuman Operator has deep knowledge of how the system is expected to behave knows how to deploy it has deep knowledge of how the system is expected to behave knows how to deploy it
  • 59.
    Human VS K8s managesa service or a set of services KUBERNETES OPERATORS manages a service or a set of services Kubernetes OperatorHuman Operator has deep knowledge of how the system is expected to behave knows how to deploy it knows how to react if there are problems has deep knowledge of how the system is expected to behave knows how to deploy it knows how to react if there are problems
  • 60.
    Human VS K8s managesa service or a set of services KUBERNETES OPERATORS manages a service or a set of services Kubernetes OperatorHuman Operator has deep knowledge of how the system is expected to behave knows how to deploy it knows how to react if there are problems automates repetitive tasks has deep knowledge of how the system is expected to behave knows how to deploy it knows how to react if there are problems uses automation for repetitive tasks
  • 61.
    Human VS K8s managesa service or a set of services KUBERNETES OPERATORS manages a service or a set of services Kubernetes OperatorHuman Operator has deep knowledge of how the system is expected to behave knows how to deploy it knows how to react if there are problems automates repetitive tasks has deep knowledge of how the system is expected to behave knows how to deploy it knows how to react if there are problems uses automation for repetitive tasks can only manage a limited number of instances can manage a very high number of instances
  • 64.
    Flink Custom Resource Declarative model Modelthe configuration and the deployment of a Flink cluster KUBERNETES OPERATORS
  • 65.
    Flink Custom Resource Declarative model Modelthe configuration and the deployment of a Flink cluster State representation Used by the operator to keep track of the state of any Flink cluster KUBERNETES OPERATORS
  • 66.
    Flink Custom Resource Declarative model Modelthe configuration and the deployment of a Flink cluster State representation Labels and Annotations Used by the operator to keep track of the state of any Flink cluster Used for selecting the components to update or to signal that the user requested a shutdown KUBERNETES OPERATORS
  • 69.
    Flink Dashboard isaccessible via an Ingress rule ASSEMBLING FLINK CLUSTERS
  • 70.
    Flink Dashboard Ingress Exposes HTTPand HTTPS routes from outside the cluster to services within the cluster is accessible via an Ingress rule ASSEMBLING FLINK CLUSTERS
  • 71.
    Flink Dashboard Ingress Exposes HTTPand HTTPS routes from outside the cluster to services within the cluster is accessible via an Ingress rule ASSEMBLING FLINK CLUSTERS Ingress Controller Ingresses and ingress rules are managed by their own “operator”
  • 73.
  • 74.
    YELP PAASTA PaaSTA is... ahighly-available, distributed system for building, deploying, and running services using containers and Apache Mesos. (Yelp)
  • 75.
    YELP PAASTA PaaSTA is... ahighly-available, distributed system for building, deploying, and running services using containers and Apache Mesos Kubernetes. (Yelp)
  • 76.
    YELP PAASTA Why integrating?Consistent interface Every PaaSTA user knows how to interact with any service regardless of its nature
  • 77.
    YELP PAASTA Why integrating?Consistent interface Every PaaSTA user knows how to interact with any service regardless of its nature Infrastructure as a Service Whether it is a Web server, a Cassandra cluster or a Flink job, to the user everything is a service
  • 78.
    YELP PAASTA Why integrating?Consistent interface Every PaaSTA user knows how to interact with any service regardless of its nature Infrastructure as a Service Platform engineers are users too Whether it is a Web server, a Cassandra cluster or a Flink job, to the user everything is a service Shared infrastructure and tools are exposed as services, libraries and CLIs to platform developers
  • 81.
    main: job_type: stateful checkpoint_interval_ms :30000 deploy_group: prod taskmanager: cpus: 2.0 mem: 10G instances: 3 checkpoint_path : s3://flink-state/service/main/checkpoints savepoint_path : s3://flink-state/service/main/savepoints flink_conf: taskmanager.network.detailed-metrics : "true" env.java.opts.taskmanager : "-XX:+UseConcMarkSweepGC"
  • 82.
    main: job_type: stateful checkpoint_interval_ms :30000 deploy_group: prod taskmanager: cpus: 2.0 mem: 10G instances: 3 checkpoint_path : s3://flink-state/service/main/checkpoints savepoint_path : s3://flink-state/service/main/savepoints flink_conf: taskmanager.network.detailed-metrics : "true" env.java.opts.taskmanager : "-XX:+UseConcMarkSweepGC" Custom Resource Definition
  • 88.
    YELP PAASTA User InteractionCheck status paasta status -s service -i instance -r region
  • 89.
    paasta logs -sservice -i instance -n 100 YELP PAASTA User Interaction Check status paasta status -s service -i instance -r region Read logs
  • 90.
    paasta logs -sservice -i instance -n 100 YELP PAASTA User Interaction Check status paasta status -s service -i instance -r region Read logs Deploy a new version Different UX and infrastructure from the rest of Yelp led to high barrier to entry and knowledge impedance git commit && git push origin master
  • 91.
  • 92.
    THE FUTURE Python on Beam onFlink on Kubernetes
  • 93.
  • 94.
  • 95.
  • 96.
  • 97.
  • 98.
  • 99.
    What’s next JobOriented Deployment More isolation, faster restarts and simpler deployment by running a single job per Flink cluster THE FUTURE
  • 100.
    What’s next JobOriented Deployment More isolation, faster restarts and simpler deployment by running a single job per Flink cluster Reactive Container Mode and Autoscaling Flink will automatically react to new resources available in K8s by rescaling the job (FLINK-10407) THE FUTURE
  • 101.
    What’s next JobOriented Deployment More isolation, faster restarts and simpler deployment by running a single job per Flink cluster Reactive Container Mode and Autoscaling Thinner Supervisor Flink will automatically react to new resources available in K8s by rescaling the job (FLINK-10407) Move savepoints, jobs lifecycle and configuration management from the Supervisor to the Operator THE FUTURE
  • 102.
  • 103.
    Let’s do it! SHOULDI DO IT? O(1) people for O(N) clusters A K8s operator allows you to scale up your number of Flink clusters without adding more human operators
  • 104.
    Let’s do it!O(1) people for O(N) clusters A K8s operator allows you to scale up your number of Flink clusters without adding more human operators Operators to codify knowledge Codifying operational knowledge is easier than passing it all down to new hires SHOULD I DO IT?
  • 105.
    Let’s do it!O(1) people for O(N) clusters A K8s operator allows you to scale up your number of Flink clusters without adding more human operators Operators to codify knowledge A catalyst for users Codifying operational knowledge is easier than passing it all down to new hires Once integrated with your platform, users don’t have to learn how to deploy or configure a Flink job anymore SHOULD I DO IT?
  • 106.
    Or maybe notThe Kubernetes Tax Embedding Kubernetes into your platform requires a pretty solid effort, if you haven’t done it yet SHOULD I DO IT?
  • 107.
    Or maybe notThe Kubernetes Tax Embedding Kubernetes into your platform requires a pretty solid effort, if you haven’t done it yet (Build ∨ Buy) → Time It takes some time to write your own operator or to fit an existing one into your platform SHOULD I DO IT?
  • 108.
    Or maybe notThe Kubernetes Tax Embedding Kubernetes into your platform requires a pretty solid effort, if you haven’t done it yet (Build ∨ Buy) → Time It takes some time to write your own operator or to fit an existing one into your platform SHOULD I DO IT? There is always the cloud Cloud providers are starting to offer managed platforms based on Kubernetes operators
  • 109.
  • 110.
  • 111.
  • 112.