On Database Resilience. Interview with Seth Proctor

by Roberto V. Zicari on March 17, 2015

“In normal English usage the word resilience is taken to mean the power to resume original shape after compression; in the context of data base management the term data base resilience is defined as the ability to return to a previous state after the occurrence of some event or action which may have changed that state.
from “P. A Dearnley, School of Computing Studies, University of East Anglia, Norwich NR4 7TJ , 1975 “

On the topic database resilience, I have interviewed Seth Proctor, Chief Technology Officer at NuoDB.

RVZ

Q1. When is a database truly resilient?

Seth Proctor: That is a great question, and the quotation above is a good place to start. In general, resiliency is about flexibility. It’s kind of the view that you should bend but not break. Individual failures (server crashes, disk wear, tripping over power cables) are inevitable but don’t have to result in systemic outages.
In some cases that means reacting to failure in a way that’s non-disruptive to the overall system.
The redundant networks in modern airplanes are a great example of this model. Other systems take a deeper view, watching global state to proactively re-route activity or replace components that may be failing. This is the model that keeps modern telecom networks running reliably. There are many views applied in the database world, but to me a resilient database is one that can react automatically to likely or active failures so that applications continue operating with full access to their data even as failures occur.

Q2. Is database resilience the same as disaster recovery?

Seth Proctor: I don’t believe it is. In traditional systems there is a primary site where the database is “active” and updates are replicated from there to other sites. In the case of failure to the primary site, one of the replicas can take over. Maintaining that replica (or replicas) is usually the key part of Disaster Recovery.
Sometimes that replica is missing the latest changes, and usually the act of “failing over” to a replica involves some window where the database is unavailable. This leads to operational terms like “hot stand-by” where failing over is faster but still delayed, complicated and failure-prone.

True resiliency, in my opinion, comes from systems that are designed to always be available even as some components fail. Reacting to failure efficiently is a key requirement, as is survival in case of complete site loss, so replicating data to multiple locations is critical to resiliency. At a minimum, however, a resilient data management solution cannot lose data (sort of “primum non nocere” for servers) and must be able to provide access to all data even as servers fail. Typical Disaster Recovery solutions on their own are not sufficient. A resilient solution should also be able to continue operations in the face of expected failures: hardware and software upgrades, network updates and service migration.
This is especially true as we push out to hybrid cloud deployments.

Q3. What are the database resilience requirements and challenges, especially in this era of Big Data?

Seth Proctor: There is no one set of requirements since each application has different goals with different resiliency needs. Big Data is often more about speeds and volume while in the operational space correctness, latency and availability are key. For instance, if you’re handling high-value bank transactions you have different needs than something doing weekly trend-analysis on Internet memes. The great thing about “the cloud” is the democratization of features and the new systems that have evolved around scale-out architectures. Things like transactional consistency were originally designed to make failures simpler and systems more resilient; as consistent data solutions scale out in cloud models it’s simpler to make any application resilient without sacrificing performance or increasing complexity.

That said, I look for a couple of key criteria when designing with resiliency in mind. The first is a distributed architecture, the foundation for any system to survive individual failure but remain globally available.
Ideally this provides a model where an application can continue operating even as arbitrary components fail. Second is the need for simple provisioning & monitoring. Without this, it’s hard to react to failures in an automatic or efficient fashion, and it’s almost impossible to orchestrate normal upgrade processes without down-time. Finally, a database needs to have a clear model for how the same data is kept in multiple locations and what the failure modes are that could result in any loss. These requirements also highlight a key challenge: what I’ve just described are what we expect from cloud infrastructure, but are pushing the limits of what most shared-nothing, sharded or vertically-scaled data architectures offer.

Q4. What is the real risk if the database goes offline?

Seth Proctor: Obviously one risk is the ripple effect it has to other services or applications.
When a database fails it can take with it core services, applications or even customers. That can mean lost revenue or opportunity and it almost certainly means disruption across an organization. Depending on how a database goes offline, the risk may also extend to data loss, corruption, or both. Most databases have to trade-off certain elements of latency against guaranteed durability, and it’s on failure that you pay for that choice. Sometimes you can’t even sort out what information was lost. Perhaps most dangerous, modern deployments typically create the illusion of a data management service by using multiple databases for DR, scale-out etc. When a single database goes offline you’re left with a global service in an unknown state with gaps in its capabilities. Orchestrating recovery is often expensive, time-consuming and disruptive to applications.

Q5. How are customers solving the continuous availability problem today?

Seth Proctor: Broadly, database availability is tackled in one of two fashions. The first is by running with many redundant, individual, replicated servers so that any given server can fail or be taken offline for maintenance as needed. Putting aside the complexity of operating so many independent services and the high infrastructure costs, there is no single view of the system. Data is constantly moving between services that weren’t designed with this kind of coordination in mind so you have to pay close attention to latencies, backup strategies and visibility rules for your applications. The other approach is to use a database that has forgone consistency, making a lot of these pieces appear simpler but placing the burden that might be handled by the database on the application instead. In this model each application needs to be written to understand the specifics of the availability model and in exchange has a service designed with redundancy.

Q6. Now that we are in the Cloud era, is there a better way?

Seth Proctor: For many pieces of the stack cloud architectures result in much easier availability models. For the database specifically, however, there are still some challenges. That said, I think there are a few great things we get from the cloud design mentality that are rapidly improving database availability models. The first is an assumption about on-demand resources and simplicity of spinning up servers or storage as needed. That makes reacting to failure so much easier, and much more cost-effective, as long as the database can take advantage of it. Next is the move towards commodity infrastructure. The economics certainly make it easier to run redundantly, but commodity components are likely to fail more frequently. This is pushing systems design, making failure tests critical and generally putting more people into the defensive mind-set that’s needed to build for availability. Finally, of course, cloud architectures have forced all of us to step back and re-think how we build core services, and that’s leading to new tools designed from the start with this point of view. Obviously that’s one of the most basic elements that drives us at NuoDB towards building a new kind of database architecture.

Q7. Can you share methodologies for avoiding single points of failure?

Seth Proctor: For sure! The first thing I’d say is to focus on layering & abstraction.
Failures will happen all the time, at every level, and in ways you never expect. Assume that you won’t test all of them ahead of time and focus on making each little piece of your system clear about how it can fail and what it needs from its peers to be successful. Maybe it’s obvious, but to avoid single points of failure you need components that are independent and able to stand-in for each other. Often that means replicating data at lower-levels and using DNS or load-balancers at a higher-level to have one name or endpoint map to those independent components. Oh, also, decouple your application logic as much as possible from your operational model. I know that goes against some trends, but really, if your application has deep knowledge of how some service is deployed and running it makes it really hard to roll with failures or changes to that service.

Q8. What’s new at NuoDB?

Seth Proctor: There are too many new things to capture it all here!
For anyone who hasn’t looked at us, NuoDB is a relational database build against a fundamentally new, distributed architecture. The result is ACID semantics, support for standard SQL (joins, indexes, etc.) and a logical view of a single database (no sharding or active/passive models) designed for resiliency from the start.
Rather than talk about the details here I’d point people at a white paper (Note of the Editor: registration required) we’ve just published on the topic.
Right now we’re heavily focused on a few key challenges that our enterprise customers need to solve: migrating from vertical scale to cloud architectures, retaining consistency and availability and designing for on-demand scale and hybrid deployments. Really important is the need for global scale, where a database scales to multiple data centers and multiple geographies. That brings with it all kinds of important requirements around latencies, failure, throughput, security and residency. It’s really neat stuff.

Q9- How does it differentiate with respect to other NoSQL and NewSQL databases?

Seth Proctor: The obvious difference to most NoSQL solutions is that NuoDB supports standard SQL, transactional consistency and all the other things you’d associate with an RDBMS.
Also, given our focus on enterprise use-cases, another key difference is the strong baseline with security, backup, analysis etc. In the NewSQL space there are several databases that run in-memory, scale-out and provide some kind of SQL support. Running in-memory often means placing all data in-memory, however, which is expensive and can lead to single points of failure and delays on recovery. Also, there are few that really support the arbitrary SQL that enterprises need. For instance, we have customers running 12-way joins or transactions that last hours and run thousands of statements.
These kinds of general-purpose capabilities are very hard to scale on-demand but they are the requirement for getting client-server architectures into the cloud, which is why we’ve spent so long focused on a new architectural view.
One other key difference is our focus on global operations. There are very few people trying to take a single, logical database and distribute it to multiple geographies without impacting consistency, latency or security.

Qx Anything else you wish to add?

Seth Proctor: Only that this was a great set of questions, and exactly the direction I encourage everyone to think about right now. We’re in a really exciting time between public clouds, new software and amazing capacity from commodity infrastructure. The hard part is stepping back and sorting out all the ways that systems can fail.
Architecting with resiliency as a goal is going to get more commonplace as the right foundational services mature.
Asking yourself what that means, what failures you can tolerate and whether you’re building systems that can grow alongside those core services is the right place to be today. What I love about working in this space today is that concepts like resilient design, until recently a rarefied approach, are accessible to everyone.
Anyone trying to build even the simplest application today should be asking these questions and designing from the start with concepts like resiliency front and center.

———–
Seth Proctor, Chief Technology Officer, NuoDB

Seth has 15+ years of experience in the research, design and implementation of scalable systems. That experience includes work on distributed computing, networks, security, languages, operating systems and databases all of which are integral to NuoDB. His particular focus is on how to make technology scale and how to make users scale effectively with their systems.

Prior to NuoDB Seth worked at Nokia on their private cloud architecture. Before that he was at Sun Microsystems Laboratories and collaborated with several product groups and universities. His previous work includes contributions to the Java security framework, the Solaris operating system and several open source projects, in addition to the design of new distributed security and resource management systems. Seth developed new ways of looking at distributed transactions, caching, resource management and profiling in his contributions to Project Darkstar. Darkstar was a key effort at Sun which provided greater insights into how to distribute databases.

Seth holds eight patents for his cutting edge work across several technical disciplines. He has several additional patents awaiting approval related to achieving greater database efficiency and end-user agility.

Resources

– Hybrid Transaction and Analytical Processing with NuoDB

– NuoDB Larks Release 2.2

– Various Resources on NuoDB.

Related Posts

On Big Data and the Internet of Things. Interview with Bill Franks
Source: ODBMS.org | Published on 2015-03-09
On MarkLogic 8. Interview with Stephen Buxton
Source: ODBMS.org | Published on 2015-02-13
On Data Curation. Interview with Andy Palmer
Source: ODBMS.org | Published on 2015-01-14
On Solr and Mahout. Interview with Grant Ingersoll
Source: ODBMS.org | Published on 2015-01-06

Follow ODBMS.org on Twitter: @odbmsorg

From → Uncategorized

No comments yet

Leave a Reply Cancel reply

About the author

Archives

Meta

About

Flickr

Search