Data
observability
Agenda
Here’s our next :30 minutes
- Intro
- Why?
- How?
- Problems?
Who am I?
Anastasia Khlebnikova
Senior backend/data engineer at Spotify.
What do I do?
Part of the Data and Insights tribe at Spotify
Our team owns: one of the biggest services at Spotify (~1M rps) and
one of the biggest pipelines at Spotify: anonymization of event
delivery
Data observability. WHY?
Where is the data coming from?
Event
Delivery
System
Pseudonymization
Cloud Storage
pseudonymization pipelines
for every event
run every hour
8 million events per second
A bit of scale
● 8.000.000 events per second
● Largest events are around 8 billions events per
hour
● Over 400 unique event types published in
separate datasets
● 500TB of data a day
● We used to own the largest Hadoop cluster in
Europe
How does it feel to be on-call?
If you need to hot fix something in production it is like changing a flat
tire in a car going 200 km/h on a highway without stopping it! The
longer your system is stopped, the longer it will take to catch up. The
time to catch up for downstream consumers will increase
exponentially.
Who needs that much the data?
Once delivered, events are processed by numerous data
jobs currently running in Spotify. There are many different
use cases for which the delivered data is used. Data can
be used to produce music recommendations, analyse our
A/B tests or analyse our client crashes. Most importantly,
delivered data is used to calculate royalties which are paid
to artists based on generated streams.
Data observability. HOW?
Make sure your data is discoverable
To annotate your data is the key to avoid piles of mess!!!
Upsides:
➢ Other people can use/find your data
➢ Sensitive data in the dataset? Encrypt based on annotations
➢ Easy mapping in your code like schema <-> case class
➢ Easier to find which key to join on
Downsides:
➢ you have to do it. Once
Schema Example
Monitor your pipelines. Execution time
Monitor your pipelines. Count it!
Never produce corrupt data! Implement as much sanity-checks as possible.
Example: if your pipeline encrypts the row in the dataset, based on user_id.
And use the random key otherwise (impossible to decrypt)
Count it: Count the %ge of rows where the user_id have not been found or
parsed, thus alert if it increased more than ….10percent?
Sanity check your data and alert!
Monitor your pipelines. Money
GCP DataFlow
GCP BigQuery
GCP storage
Monitor your pipelines. Money. Real incident
$
time
Incident
Monitor your pipelines. Money. Per System
Taking GDPR as an example. How much does it cost to “Download your
data” ? How much should you put inside?
What to monitor:
● How much do we pay for every request?
● The cost above: what is the cost of every pipeline that contributes to
it?
● How many requests do people actually open?
Set up a retention! Storage is ⅓ of the cost
● Setup the default retention. Remove the partitions after the
expiration date
● Profile the storage. Can the cold storage be used (cheap to store,
expensive to access)
● It adds up: multi-regional vs regional buckets. Where the data is
accessed from?
● How the data is used? BigQuery or pipelines
Monitor your pipelines. Alerts on failures
SLAs for your partitions
Concept of low, normal and high priority for
events. It gives us different SLAs for different events
(depending on importance, 6h, 24h, 72h). Thanks to
that we know which events recover first when shit hits
the fan. This also made our life better as normal
priority events will not alert during nights and low
priority events will not alert during weekends.
Dashboards!
Does your infra lose the BCD?
Business critical data: royalty calculations, user accounts, ads etc
➔ High SLO
➔ “Special treatment” when recovering from the incident
➔ …..and special observability since the amount of “BCD” events
are limited
How to prove that no data is lost?
SDK
Service
ReceiverS
ervice P/S
Make hourly
partitions,
dedup,
anonymize
Hourly
partitions
Who is watching the watcher
SDK
Service
ReceiverS
ervice P/S
Make hourly
partitions,
dedup,
anonymize
Hourly
partitions
Streamig
job
NACK
Rejec
ted
Counting
Service
Compare
Data observability. Bottom line
Why to bother?
Data is using a lot and processing and storage is EXPENSIVE. How much
profit does it bring though?
Thank you!

Observability at Spotify

  • 1.
  • 2.
    Agenda Here’s our next:30 minutes - Intro - Why? - How? - Problems?
  • 3.
    Who am I? AnastasiaKhlebnikova Senior backend/data engineer at Spotify.
  • 4.
    What do Ido? Part of the Data and Insights tribe at Spotify Our team owns: one of the biggest services at Spotify (~1M rps) and one of the biggest pipelines at Spotify: anonymization of event delivery
  • 5.
  • 6.
    Where is thedata coming from? Event Delivery System Pseudonymization Cloud Storage pseudonymization pipelines for every event run every hour 8 million events per second
  • 7.
    A bit ofscale ● 8.000.000 events per second ● Largest events are around 8 billions events per hour ● Over 400 unique event types published in separate datasets ● 500TB of data a day ● We used to own the largest Hadoop cluster in Europe
  • 8.
    How does itfeel to be on-call? If you need to hot fix something in production it is like changing a flat tire in a car going 200 km/h on a highway without stopping it! The longer your system is stopped, the longer it will take to catch up. The time to catch up for downstream consumers will increase exponentially.
  • 9.
    Who needs thatmuch the data? Once delivered, events are processed by numerous data jobs currently running in Spotify. There are many different use cases for which the delivered data is used. Data can be used to produce music recommendations, analyse our A/B tests or analyse our client crashes. Most importantly, delivered data is used to calculate royalties which are paid to artists based on generated streams.
  • 10.
  • 11.
    Make sure yourdata is discoverable To annotate your data is the key to avoid piles of mess!!! Upsides: ➢ Other people can use/find your data ➢ Sensitive data in the dataset? Encrypt based on annotations ➢ Easy mapping in your code like schema <-> case class ➢ Easier to find which key to join on Downsides: ➢ you have to do it. Once
  • 12.
  • 13.
  • 14.
    Monitor your pipelines.Count it! Never produce corrupt data! Implement as much sanity-checks as possible. Example: if your pipeline encrypts the row in the dataset, based on user_id. And use the random key otherwise (impossible to decrypt) Count it: Count the %ge of rows where the user_id have not been found or parsed, thus alert if it increased more than ….10percent? Sanity check your data and alert!
  • 15.
    Monitor your pipelines.Money GCP DataFlow GCP BigQuery GCP storage
  • 16.
    Monitor your pipelines.Money. Real incident $ time Incident
  • 17.
    Monitor your pipelines.Money. Per System Taking GDPR as an example. How much does it cost to “Download your data” ? How much should you put inside? What to monitor: ● How much do we pay for every request? ● The cost above: what is the cost of every pipeline that contributes to it? ● How many requests do people actually open?
  • 18.
    Set up aretention! Storage is ⅓ of the cost ● Setup the default retention. Remove the partitions after the expiration date ● Profile the storage. Can the cold storage be used (cheap to store, expensive to access) ● It adds up: multi-regional vs regional buckets. Where the data is accessed from? ● How the data is used? BigQuery or pipelines
  • 19.
    Monitor your pipelines.Alerts on failures
  • 20.
    SLAs for yourpartitions Concept of low, normal and high priority for events. It gives us different SLAs for different events (depending on importance, 6h, 24h, 72h). Thanks to that we know which events recover first when shit hits the fan. This also made our life better as normal priority events will not alert during nights and low priority events will not alert during weekends.
  • 21.
  • 22.
    Does your infralose the BCD? Business critical data: royalty calculations, user accounts, ads etc ➔ High SLO ➔ “Special treatment” when recovering from the incident ➔ …..and special observability since the amount of “BCD” events are limited
  • 23.
    How to provethat no data is lost? SDK Service ReceiverS ervice P/S Make hourly partitions, dedup, anonymize Hourly partitions
  • 24.
    Who is watchingthe watcher SDK Service ReceiverS ervice P/S Make hourly partitions, dedup, anonymize Hourly partitions Streamig job NACK Rejec ted Counting Service Compare
  • 25.
  • 26.
    Why to bother? Datais using a lot and processing and storage is EXPENSIVE. How much profit does it bring though?
  • 27.