How a Cache Becomes a Database (And Why You Really Don’t Want This to Happen)
Are you on the right road? Adding feature after feature can create a situation where the only thing reliably stored is technical debt.
BY David Rolfe
There’s an old Irish joke in which a traveler asks a local for directions, and the response is “I wouldn’t start from here if I were you!”. Sometimes a series of entirely reasonable decisions leads to an entirely unreasonable situation, where the best course of action turns out to be starting again, from somewhere else.
Here at Volt Active Data we see a clear anti-pattern emerging: people struggling to use products that evolved from caches as highly available enterprise key-value stores for volatile data. One day they wake up and realize that they have an expensive and unmanageable monster on their hands.
Like many things in life, the key question you need to ask yourself is:
“Where does this journey end?”
If you are using an evolved cache and are happy, then that’s great! But if you need to make changes, we would argue that you are much better off thinking about where you actually want to end up, and then working backwards, then you are repeatedly adding ‘one more thing’ to your cache. Whether Volt is something you should consider depends on your circumstances, and while we’re happy to talk to you, we’re not going to blindly recommend ourselves to you. Below I describe the journey I’ve seen people go on.
SIDEBAR: The Cache-Turned-Database Issue ExplainedCaching is easy with large, nearly static data sets, but very difficult with volatile (ie, rapidly changing) data. Companies typically start out with a simple, harmless cache, then slowly add more functionality until the cache ends up looking a lot like a database—because it is a database. Sometimes it fails to meet the new need, and sometimes it does so by creating technical debt. Once a product or API is in widespread use, radical, non-additive change is really hard to do. JSR107, aka “javax cache” is a standard for cache implementations A number of products implement or shadow JSR107, such as: Hazelcast GridGain Oracle Coherence Terracotta Ehcache Infinispan If you read the list above you will see that not all of these are caches. In fact, many of these products appear to be solving a market need for something that looks like a cache but is really something else. While you don’t need JSR107 to do a cache, any cache you do create will hit the same issues if you follow the same evolutionary path. For the purposes of this article, we’ll use JSR107 as a basis for discussion. |
1. You start out with a simple cache
Imagine you started with a basic cache with support for ‘Get’ and ‘Put’ operations: You needed to store stuff in local RAM so you could access it quickly but didn’t want to mess with low-level code. Why couldn’t you just use an abstract cache that loaded data lazily? You can – so you implemented one.
2. You add local storage for when you need to restart
Your basic cache worked really well, and became an essential part of the system, as the back-end system couldn’t support the read workload. This created a problem: If you need to restart such a cache while the system is running it would take ages for the cache to be refilled, and you’d see performance problems.
The obvious solution is a cache that stores its contents on a local disk, which means it can recover gracefully from a restart. This is why JSR107 has methods like ‘close’ and ‘clear’.
3. You need to persist changes, so you turn it into a “write through” cache
The next ‘ask’ is a ‘write through’ capability. You needed to be able to change the cache and then flush the changes to a back-end system because if you updated the back-end system directly anyone reading the cache would get stale information until you reloaded it. At this stage, your cache is already starting to look a bit like a database, especially if your “write through” functionality worked even if the back-end system is down.
4. Your cache is now mission-critical, so you make it highly available
The next ‘ask’ is for high availability because what you’re doing is now too important to rely on a single server. The obvious solution is to create an active-Passive pair of servers, generally with mixed results. You can read from the passive node, but you can’t write to it, which brings you to your next big change: scaling writes. Now — you really need to rethink things.
5. Your cache is so busy with writes it can’t fit on a single server
At some point, you need to provide a cluster of ‘cache’ servers to support your ever-increasing write workload. This unleashes a tide of complexity under the covers, as who is allowed to modify what and where becomes a major issue. This is a complex problem that has taxed major database vendors and is one of the reasons Volt was invented.
6. You add “change data capture” to your cache
Eventually, as you add use cases, a requirement for a change listener appears. Change listeners allow you to listen in to changes made by other people. They are the kind of thing that looks simple on a whiteboard but can be an implementation and scaling nightmare. At this point, you should be questioning the value of rolling your own solution.
7. Your cache has problems with write contention
In the real world, issues appear when multiple people try to update the same thing at the same time and overwrite each other’s work. The JSR107 spec addresses this with a ‘replace’ method where you pass the old record in as well as the new one, and only proceed if your old one is still in place. This is a poor man’s row level lock, in that it prevents people from corrupting the data at the expense of forcing them to retry multiple times. This shows up as long tail latency in performance-sensitive applications, and there are also the cost issues of sending multiple copies of a potentially large object across the wire and doing an expensive byte-by-byte comparison at the other end.
The last step of this evolution attempted to solve this problem by introducing the concept of ‘invoking’ chunks of code on the server itself, instead of reading the data, changing it and sending it back. This is done using an implementation of EntryProcessor. This avoids the problems we hit in the previous step but at the expense of significant complexity.
8. The end state: your cache is broken in many ways….
At this point, you have something that has the API of a cache but is very clearly not a cache. In fact, the requirements closely match what you’d need for an enterprise data platform. The pain of making such a transition is real, but so is the pain of making an over-extended and over-architected cache work.
9. What now?
If you recognize yourself on this caching journey you should probably stop and figure out your desired end state before you do anything else. If you want to try us we have a running implementation of a JSR107 compliant cache you can try out.
Note that we wouldn’t suggest this is the ideal way to deploy and use Volt, but it provides a way for people to rapidly evaluate Volt Active Data against workloads that are being run on JSR107 related ‘variations on the theme of cache’. Contact us for further information.
……………………………………….
David Rolfe brings 20+ years of experience managing data in the telecom industry. David helps
telecom software vendors meet the scale and latency requirements imposed by 5G data utilizing
Volt Active Data.
He helps companies take the steps they need to deploy mass-scale, ultra-low latency
transactional applications in cloud-native environments. He has over 25 years of experience with
high-performance databases and telco systems and demonstrated expertise with charging and
policy systems. He has authored multiple patents relating to geo-replicated conflict resolution.
Sponsored by Volt Active Data