Q&A with Data Engineers: John Hugg
John Hugg has spent his entire career working with databases and information management at a number of startups including Vertica Systems and now VoltDB. As the founding engineer at VoltDB, he was involved in all of the key early design decisions and worked collaboratively with the new VoltDB team as well as academic researchers exploring similar problems. In addition to his engineering role, John has become a primary evangelist for the technology, speaking at conferences worldwide and authoring blog posts on VoltDB, as well as on the state of OLTP and stream processing broadly.
Q1. What are in your opinion the differences between capacity, scale, and performance?
I think of “capacity” as the ability to add more resting data, more throughput, or more complexity to a service without impacting concrete or implied SLAs.
That same system can “scale” if I can add more storage, networking, or compute to increase my “capacity”. Note that these words are multidimensional, and which dimensions are important is largely determined by the application.
Performance is also nuanced. The two key steady-state metrics are “throughput” and “latency” and the application will determine acceptable values here. But often just as critical are metrics like MTTR (mean time to recovery) or reconfiguration times. If three different systems all meet throughput SLAs, it may not matter which is faster in the steady state; I’m going to choose based on other factors.
This is one reason our team at VoltDB has invested more effort on operational event performance than steady state performance recently. Our latest version has slightly higher throughput than our software from three years ago, but it recovers from failure faster, it changes configuration faster, it’s more fair when balancing workloads and it’s less susceptible to garbage collection pauses.
Q2. ACID, BASE, CAP. Are these concepts still important?
One of the most interesting shifts of the past few years is the repudiation of the practice of expecting developers to mix mutable data and eventually consistent databases correctly. One of the clearest articulations of this comes from the Google F1 paper:
The system must provide ACID transactions, and must always present applications with consistent and correct data. Designing applications to cope with concurrency anomalies in their data is very error-prone, time-consuming, and ultimately not worth the performance gains.
Data management is bifurcating into strongly consistent systems and immutable data management, two paradigms that non-wizard developers can use successfully.
For mutable data, delivering more consistency means more ACID, more CP (in CAP) and often some of both. NewSQL stalwarts like MongoDB and Cassandra are offering CP options. Newer systems like Google Spanner and VoltDB are CP/ACID and proud.
Meanwhile there is a new push for immutable data where possible, sidestepping many consistency issues. This has always been favored for data warehouses, but new distributed log systems like Kafka can bring immutable semantics to operations as well.