VoltDB 6.4 Passes Official Jepsen Testing
VoltDB hired Kyle Kingsbury, creator of the Jepsen Tests, to build a new, stronger, Jepsen test especially for VoltDB. We promise strong serializability in a distributed database, a stronger promise than almost any other system, and we’ve been working with Kingsbury to validate that promise.
What is Jepsen Testing?
Jepsen is an effort to improve the safety of distributed databases, queues, consensus systems, etc. It encompasses a software library for systems testing, as well as blog posts, and conference talks exploring particular systems’ failure modes. In each post we explore whether the system lives up to its documentation’s claims, file new bugs, and suggest recommendations for operators.
- VoltDB 6.4 has passed official Jepsen testing performed by Kyle Kingsbury, Jepsen’s creator.
- VoltDB 6.4 has passed more stringent testing than any other system Jepsen has tested.
- Jepsen found several issues in VoltDB 6.3 and we fixed every one. Our tolerance for these bugs is zero.
- We have integrated Jepsen testing into our automated testing for each VoltDB build. We have and will continue to make our Jepsen and non-Jepsen tests stronger and better.
- We stake our reputation on correctness, consistency, and safety.
You can read Kingsbury’s detailed post on his experiences here: https://aphyr.com/posts/331-jepsen-voltdb-6-3(link is external)
The most important quote:
VoltDB’s pre-6.4 development builds have now passed all the original Jepsen tests, as well as more aggressive elaborations on their themes. Version 6.4 appears to provide strong serializability: the strongest safety invariant of any system we’ve tested thus far.
Jepsen has proven its value as a tool in any distributed system tester’s arsenal, and multiple people have asked us about Jepsen and VoltDB specifically. Jepsen is both famous and notorious in the database industry for finding undiscovered problems with distributed systems. There are few tests of mettle as recognizable as Jepsen in our community.
While we had been planning to do the testing ourselves, we understood that nothing we did would have the same credibility as a test run by Kyle Kingsbury himself, creator of Jepsen and embarraser of databases. When Kingsbury started his Jepsen-For-Hire business last fall, we immediately got in line, and over the past two months, we’ve been working closely with him as he tested VoltDB.
The Most Stringent Jepsen Tests So Far
The VoltDB default consistency setting is Strong Serializability. This combines the ACID properties of serializable transactions (every transaction appears to happen in some global order) with CP-in-CAP-style linearizability (operations all happen essentially in the order the client sends them). Peter Bailis, notable database researcher and professor at Stanford University, has a blog post on the difference for those who want technical detail: Linearizability versus Serializability(link is external).
And, we’d like to point out, conventional wisdom is that this kind of consistency is too expensive, and you have to accept less (often much less) in order to scale. VoltDB manages to be fast and consistent by leveraging smart design and by specifically making some tradeoffs about what applications can do. You can read more about this on the VoltDB website: Reasons Behind the VoltDB Architecture.
As Kingsbury discusses in his post, VoltDB was run not only against Jepsen tests that look for linearizability faults, but also was run against Jepsen’s new multi-key linearizability tests. This tests VoltDB’s multi-statement, multi-key transactions for strong serializability. These tests don’t even apply to other systems with multi-key transactions because they require linearizability on top of serializability (strong serializability). Since this is a thing few other distributed databases promise (none?), it’s a test that only really applies to VoltDB.
Are the Issues Found Serious?
Jepsen found an issue in versions of VoltDB prior to 6.4 that could lead to stale reads or even dirty reads of uncommitted data under certain network partition scenarios. A user’s likelihood of encountering this issue is hard to predict and varies by application. If encountered, the seriousness of its effects vary depending on the application as well.
Jepsen also found two issues where writes could be lost under certain partition scenarios. These issues are more serious, but also easier to avoid because they are only possible to hit on uncommon deployment configurations. We have identified one production deployment out of hundreds we know about that is susceptible to these issues.
Of course we consider all correctness and data loss bugs to be drop-everything-and-fixserious. If we start getting grey—weighing the likelihood of this and the impact of that—we start down a slippery slope. To our engineering team, it must be black and white.
To our users, it is less straightforward. Many users and apps will be unaffected by these issues, but their impact to others is less clear. We’ve already reached out to our customers and other users known to be in production with VoltDB.
We have additional in-depth detail available on a technical companion page focused on these issues:
- VoltDB single-partition read-only transactions can read stale or uncommitted data under certain network partition scenarios.
- In some uncommon cluster configurations, VoltDB can lose committed writes after a network partition.
If you have questions about how likely you are to hit these issues in your deployment and/or you are unable to update to 6.4, please reach out to VoltDB support at email@example.com.
Reproducible and Open
We believe having Kingsbury run these tests himself adds credibility to the results, but everything Kingsbury has done is reproducible and open. You can find the Jepsen driver for VoltDB at https://github.com/jepsen-io/voltdb(link is external), which allows you to run the full Jepsen tests described in Kingsbury’s blog post against a 30-day trial of VoltDB, or your own licensed copy.
We’d like to point out that this kind of reproducible testing is only possible because VoltDB is standalone software that our users fully control. People often cite lock-in as a major tradeoff of popular Database-as-a-Service offerings, but it’s also important to note that this kind of fault-injection-based testing just isn’t possible when you don’t control the environment.
Data Safety and Correct Answers Are Our Highest Priority
Among the things that set VoltDB apart is our combination of a strongly ACID relational SQL database on a natively clustered platform. We’re database people, but we’re also distributed systems people.
Selling a data product on its strong consistency and robust fault tolerance can be challenging, and is based on credibility and trust. Our marketing material can tell you your data is safe, but anyone can do that. We show you we take this seriously though our actions.
Take today for example. We hired Jepsen as soon as we could. We held the release of VoltDB 6.4 until every bug was fixed. In one case, we made a minor performance sacrifice to make sure default consistency settings were as strong as we promised.
VoltDB is, first and foremost, an operational database for the 21st century. To us, that means it has to check a few boxes:
- You can trust VoltDB with your data. Keeping your data safe is job #1 at VoltDB. Any data corruption or loss issues are release-blocking bugs that are prioritized over all other work.
- The reads, computations and writes you do in a VoltDB transaction are 100% correct as of the time the transaction executes. We believe it is our job to worry about as much of the complexity of distributed systems and database consistency models as we can, leaving the developer to focus primarily on his or her business logic.
- It needs to be as easy as we can make it to keep VoltDB running for years on end. We don’t require an external ZooKeeper. We don’t have different kinds of nodes in a cluster. Installation is as simple as unpacking a tarball. Replacing failed nodes or recovering a full cluster are single command line operations.
- Our testing has to be outstanding and as transparent as possible.
There are certainly other things we care about. We spend lots of time making VoltDB easier to use and also expanding our use cases. For example, in 6.0, we released added support for geospatial types, queries and indexes.
But being the best at something means focus and prioritizing. At VoltDB, building an operational database means data safety, high availability, and management and administrative simplicity. This is why VoltDB is trusted by many of the world’s largest telecommunications networks in their critical infrastructures. These customers are not easy to please, but it’s exactly these high operational standards that allow us to stand apart from other systems.
The immediate next step after Jepsen for us involved getting the Jepsen testing harness into our continuous integration process. This allows us to test all nightly builds and upcoming releases automatically. It also allows us to run Jepsen on specific branches as we develop new features.
We’re also in the middle of a post-mortem on our other tests. Some of our tests overlap with the kinds of issues Jepsen is designed to find. These tests have found many issues over the years and have been invaluable in making VoltDB as robust as it is. Still, Jepsen found a few issues that weren’t covered by our existing tests. We are working to understand why these issues weren’t found, and also what kinds of things we can change to find these issues in the future. In the meantime, as mentioned above, we are internally running Jepsen regularly alongside our existing tests.
As we push forward on the 6.x releases, and ultimately to 7.0 and beyond, we plan to continue to expand our existing tests and create new ones that come at problems from new directions.
We’ve created a Transaction and Consistency FAQ with some additional background. We hope this helps readers understand how VoltDB works and what Jepsen found. Kingsbury’s post also mentions some future areas he’d like to explore with VoltDB and we cover those topics too.
(LINKs are external):
- What is K-Safety?
- What is a read-only transaction and what is a write or read-write transaction?
- What is a committed transaction?
- What does ‘all live replicas’ mean?
- How can VoltDB fail from two to just one while guarding against split brain?
- When is a transaction committed?
- How fast are cross-partition transactions?
- How does VoltDB handle partial network partitions?
- How does VoltDB handle bit flip errors on disk?
- How does VoltDB handle clock skew in its agreement algorithms?
It’s our goal at VoltDB to make each release stronger than the previous ones. When customers ask me what the best release of VoltDB is, I always say the latest one. I owe my confidence to an engineering team that cares about quality, and a continuous integration effort that makes it easy to build on past releases and make software better, all while minimizing regressions.
Finally, we’d especially like to thank Kyle Kingsbury for his work on this project. It’s been a pleasure working with him, and it’s always very helpful whenever we have a third-party expert evaluate our software and give us feedback.
So give VoltDB 6.4 a try (LINKs is external), and feel free to reach out to us if you have questions.
Sponsored by VoltDB.