RedisConf18 - Techniques for Synchronizing In-Memory Caches with Redis
The document describes techniques for synchronizing in-memory caches across multiple web servers using Redis. It discusses the problems with traditional in-process caches, such as data lag and inconsistency. The proposed solution uses Redis for the source of truth, with each web server maintaining an in-process cache. Hash slots are used to partition keys and publish updates via Redis pub/sub. When a key is requested, the server checks if its cache version is stale by comparing timestamps. This approach maintains consistency while minimizing network usage.
Cache Data Lag
Datastored in the in-process caches lags behind the source of
truth.
• Bad user experience
• Usual solution is to shorter cache expiration times, but it’s
just a trade-off
• Doesn’t eliminate the problem, only reduces the length of the lag
• Shorter cache times mean more database hits
• What would be great is push notification from the source of
truth
• But it’s not straightforward to implement push notifications from a
SQL backend
• Easy to flood the network with sync traffic
7.
Cross-Server Cache DataInconsistency
Over time the absolute expiration time of any specific key falls
out of sync, meaning the data changes depending on which
box serves the request.
• Even worse user experience
• Another non-trivial issue to solve to solve
• Need to implement two-way communication between all
nodes
• Difficult to resolve who wins when multiple nodes update
the same value at the same time
• Easy to flood network with sync messages
8.
Cache Stampede
If everyserver has its own copy of cached data, then every
server needs to refresh it, too
• Bad at process start-up, or if multiple servers have close
expiration times
• Really bad during pooled deploys when an entire pool comes
up
• Problem continues to grow as the number of servers
increases
• Easy to overload back end with requests (are you noticing a
trend here?)
9.
One Solution: SharedRedis Cache
Web Server #1 Web Server #2 Web Server #3 Web Server #4
SQL
database
10.
Shared Redis cache
ClassicRedis use case with lots of advantages:
• Solves the data consistency problem completely.
• Reduces cache data lag with a write-through cache
implementation (but watch out for DBAs with ad-hoc scripts!)
• No more cache stampede at the database (only need to
update Redis once, regardless how many clients)
But…
• Now we have a TCP roundtrip per cache access
• While Redis is incredibly fast, local RAM access is still many
times faster than network I/O and a deserialization step
Oh, and Bythe Way…
1. Solve the data consistency issue
2. Eliminate any data lag between the in-process caches and
Redis
3. Don’t blow up the network!
We haven’t been able to solve these problems effectively in the
past, but what about now that Redis is part of the infrastructure?
13.
Slowly the PiecesFall Into Place…
First piece to fall in place is Redis Pub/Sub for inter-server
communications
• Trivial to implement
• Just plain works
• Can use to both synchronize nodes (maintain consistency) and
push changes (minimize lag) to the client.
• But what data to send without saturating the network?
14.
Approach #1: Broadcastall data changes to all nodes
• Yes, you will blow up the network
• Low efficiency, all nodes receive changed data even if they’ll
never use it.
• Lots of challenges around ensuring all nodes have identical
data when multiple nodes update the same key at near-
identical times.
Solving the Data Consistency Problem
15.
Approach #2: Broadcastthe key that changed to all nodes
• Less network traffic than sending key/value pairs
• Solves the consistency issue of broadcasting values because
we’re just telling nodes to hit Redis the next time the key is
accessed.
• But Redis’ flexibility works against us here: keys can be up to
512MB, opening up the possibility of blowing up the network
just broadcasting keys.
Solving the Data Consistency Problem
16.
Approach #3: Insteadof broadcasting keys to all nodes, why not
partition keys into 16,384 buckets and just broadcast the 16-bit
bucket ID?
• Inspired by Redis cluster hash slot implementation
• Short, fixed size synchronization messages, regardless of key
size
• No value synchronization issues, just tell each client to hit
Redis next time a key matching the hash slot is requested.
• Now, just need to implement ;-)
Solving the Data Consistency Problem
17.
Still, there’s alot of things to solve:
• Since we’ve grouped keys together by their hash slot, when
we need to evict a key we actually will evict all keys sharing
the same hash slot
• An obvious solution would be to evict all values from the
local cache whose key falls in the same hash slot
• But that’s not practical, would have to scan all cache keys and
calculate their hash slot
Implementation Challenges
18.
The approach Paylocityarrived at has three main features:
• A dictionary of hash slots and the timestamp when a key in
that hash slot was updated (the lastUpdated dictionary)
• Items written to the in-process cache include the key’s hash
slot and the timestamp when the object written to the in-
process cache
• Whenever a value is updated, a sync message containing the
updated hash slot is published via Redis pub/sub
Paylocity’s Solution
19.
Key Value
HashSlot 14587
Timestamp150938476
Value <object>
App:Employee:1736
HashSlot 1228
Timestamp 163827634
Value <object>
App:Employee:2367
HashSlot 9036
Timestamp 180985776
Value <object>
App:Employee:3123
HashSlot 1231
Timestamp 179872198
Value <object>
App:Employee:4273
In-Process Cache
lastUpdated Dictionary
Hashslot Timestamp
173658476
163827634
163928374
180028372
1227
1228
1229
1230
⋮⋮
⋮⋮
Paylocity’s Solution
20.
Additionally, a Redispub/sub message hander listens for
synchronization messages
• Whenever a sync message is received, the hash slot entry in the
lastUpdated dictionary will be updated with the current
timestamp
• When retrieving data from the in-process cache, compare the
timestamp in the cache entry with the timestamp in the
lastUpdated dictionary. If the lastUpdated timestamp is greater
than the cache entry, the entry is out-of-date and should be
discarded.
Here’s the flow in the end:
More Implementation Details
21.
Reading a valuefrom the cache
Read entry from
in-process cache
Read hash slot timetamp
from lastUpdated dictionary
Update timestamp in
lastUpdated dictionary
Write entry to
in-process cache
Return value
to client
Calculate key
hash slot
Does key
exist in the
in-process
cache?
Is lastUpdated
dictionary
timestamp
greater than
the cache entry
timestamp?
Read value
from Redis
Yes
No
Yes
No
22.
Adding a valueto the cache
Calculate key
hash slot
Get current
timestamp
Add timestamp to
lastUpdated dictionary
Write entry (with
timestamp and hash slot)
to the in-process cache
Write key/value to Redis
Publish update message
to all clients
Hash slot
exists in the
lastUpdated
dictionary?
No
Yes
23.
Most scenarios aresolved by broadcasting only keys and using
Redis as a single source of truth for the cache, but not all:
• Trickiest situation occurs when a node receives a key
invalidation message just after it writes to Redis. Who actually
won?
• Must prevent state where a local node believes it has the
correct data in its in-process cache, but actually doesn’t
• Could implement a master clock, but that introduces a
bottleneck as well as a single point of failure
• Could use a distributed lock algorithm like Redlock
Still Some Timing Issues Remain
Instead of distributedlocks or a master clock, exploit order-of-
operation
• Leverage the fact that Redis is the source of truth for this cache
• Deceptively simple, high concurrency
• Update Redis, then publish the sync message
• No possibility of a client being notified before the Redis value was updated
• Always grab the current timestamp before writing to Redis, the in-
process cache, or the lastUpdated dictionary
• Eliminate the possibility that we store a timestamp that’s more recent than
the actual time we wrote the value
Still, one corner case exists…
Order of Operation
26.
RedisMultilevelCacheClient app Stopwatch
Timestampincrements
Add(key, value)
GetTimestamp()
HandleSyncMessage()
UpdateRedis, in-process cache, expiration dictionary
Timestamp increments
One Last Corner Case…
27.
Underlying problem isthat we’re using an incrementing timestamp to
determine the order operations occurred
• Can’t measure something smaller than the resolution of the measuring
device!
• Easier to visualize if you imagine the timer resolution to be a minute
• In practice, not an issue because it would require a Redis write, sync
message publish, and sync message handling within ~300 nanoseconds
• But still, don’t want to leave known issues open
In the end, just need to subtract 1 from timestamp obtained at the start of
the operation (!!!)
• This effectively forces the client to immediately re-read the value from
Redis in cases where the timer resolution prevents us from actually
determine which operation occurred first
…With an Ultimately Simple Solution
28.
• Early testingrevealed many more hits to Redis than predicted
• Root cause was that the clients were processing the update pub/sub
message they published, causing them to re-read the value they just
wrote to Redis
• Solution was to have the pub/sub handler ignore messages that
originated from the same cache provider instance
• Updating Redis consists of executing two Redis commands,
one to update the value, and the second to notify other clients
of the change
• But we don’t want to incur two TCP roundtrip.
• Lua scripting to the rescue!
And a Few Final Optimizations
29.
Unresolved Concerns andFuture Plans
• Potential for cache thrashing since groups of keys are evicted
by hash slot
• So far no issues at Paylocity with this
• Could expand the number of hash slots to better separate keys
• Current don’t handle a lost sync message well
• Essentially works an independent in-process cache
• Leverage Redis Pub/Sub to collect and publish client hit/miss
metrics
• Not too difficult to get this data into the ELK stack
• Implement XFETCH to optimize cache reload
• Support more Redis data types!