Observability at Spotify

Agenda
Here’s our next :30 minutes
- Intro
- Why?
- How?
- Problems?

Who am I?
Anastasia Khlebnikova
Senior backend/data engineer at Spotify.

What do I do?
Part of the Data and Insights tribe at Spotify
Our team owns: one of the biggest services at Spotify (~1M rps) and
one of the biggest pipelines at Spotify: anonymization of event
delivery

Where is the data coming from?
Event
Delivery
System
Pseudonymization
Cloud Storage
pseudonymization pipelines
for every event
run every hour
8 million events per second

A bit of scale
● 8.000.000 events per second
● Largest events are around 8 billions events per
hour
● Over 400 unique event types published in
separate datasets
● 500TB of data a day
● We used to own the largest Hadoop cluster in
Europe

How does it feel to be on-call?
If you need to hot fix something in production it is like changing a flat
tire in a car going 200 km/h on a highway without stopping it! The
longer your system is stopped, the longer it will take to catch up. The
time to catch up for downstream consumers will increase
exponentially.

Who needs that much the data?
Once delivered, events are processed by numerous data
jobs currently running in Spotify. There are many different
use cases for which the delivered data is used. Data can
be used to produce music recommendations, analyse our
A/B tests or analyse our client crashes. Most importantly,
delivered data is used to calculate royalties which are paid
to artists based on generated streams.

Make sure your data is discoverable
To annotate your data is the key to avoid piles of mess!!!
Upsides:
➢ Other people can use/find your data
➢ Sensitive data in the dataset? Encrypt based on annotations
➢ Easy mapping in your code like schema <-> case class
➢ Easier to find which key to join on
Downsides:
➢ you have to do it. Once

Monitor your pipelines. Execution time

Monitor your pipelines. Count it!
Never produce corrupt data! Implement as much sanity-checks as possible.
Example: if your pipeline encrypts the row in the dataset, based on user_id.
And use the random key otherwise (impossible to decrypt)
Count it: Count the %ge of rows where the user_id have not been found or
parsed, thus alert if it increased more than ….10percent?
Sanity check your data and alert!

Monitor your pipelines. Money
GCP DataFlow
GCP BigQuery
GCP storage

Monitor your pipelines. Money. Real incident
$
time
Incident

Monitor your pipelines. Money. Per System
Taking GDPR as an example. How much does it cost to “Download your
data” ? How much should you put inside?
What to monitor:
● How much do we pay for every request?
● The cost above: what is the cost of every pipeline that contributes to
it?
● How many requests do people actually open?

Set up a retention! Storage is ⅓ of the cost
● Setup the default retention. Remove the partitions after the
expiration date
● Profile the storage. Can the cold storage be used (cheap to store,
expensive to access)
● It adds up: multi-regional vs regional buckets. Where the data is
accessed from?
● How the data is used? BigQuery or pipelines

Monitor your pipelines. Alerts on failures

SLAs for your partitions
Concept of low, normal and high priority for
events. It gives us different SLAs for different events
(depending on importance, 6h, 24h, 72h). Thanks to
that we know which events recover first when shit hits
the fan. This also made our life better as normal
priority events will not alert during nights and low
priority events will not alert during weekends.

Does your infra lose the BCD?
Business critical data: royalty calculations, user accounts, ads etc
➔ High SLO
➔ “Special treatment” when recovering from the incident
➔ …..and special observability since the amount of “BCD” events
are limited

How to prove that no data is lost?
SDK
Service
ReceiverS
ervice P/S
Make hourly
partitions,
dedup,
anonymize
Hourly
partitions

Who is watching the watcher
SDK
Service
ReceiverS
ervice P/S
Make hourly
partitions,
dedup,
anonymize
Hourly
partitions
Streamig
job
NACK
Rejec
ted
Counting
Service
Compare

Data observability. Bottom line

Why to bother?
Data is using a lot and processing and storage is EXPENSIVE. How much
profit does it bring though?

Observability at Spotify

More Related Content

What's hot

Similar to Observability at Spotify

Recently uploaded

Observability at Spotify