On Distributed SQL Databases for Disaster Management. Q&A with Andrew C. Oliver

Q1. New public outage trends suggest there will be at least 20 serious, high-profile IT outages worldwide this year (*). What are the possible consequences for companies? 

Ultimately, these outages are not acceptable and depending on the industry it could be a deathnail. For consumer-facing apps, if a ride hailing app is out for a day, you will figure out how to sign up for another one. The consumer either continues out of momentum or at least starts comparing prices. These outages can even potentially become “goodwill impairment disclosures” with the SEC, which is probably not going to do wonders for the company’s stock price.

Q2.  Surviving a disaster is about preparation. How companies can safeguard themselves by working with a distributed SQL database? Why a distributed SQL database? 

Before we even get to disaster recovery or availability, distributed SQL databases handle both read and write scale. If you’re running a modern web-based service, this is pretty critical. Other topologies have a single write node and longer recovery time if a node goes down. Distributed SQL databases handle multiple zone failures and even regional failures while simultaneously maintaining data integrity and transactional consistency. If you’re moving a system of record to the cloud, you should seriously consider a distributed SQL database. If you’re running a very high-volume site, you should consider one with parallel replication between global regions (because otherwise replication can fall behind incoming writes volume).

Q3.  What if you have cloud applications? 

It’s twice as critical to use a distributed SQL database for cloud applications as you don’t control the infrastructure. While cloud infrastructure is very good and more reliable on average than self-hosted or collocated infrastructure, it isn’t under your control and other people’s decisions toward risk or upgrades can affect you. Distributed SQL databases are architected for this kind of environment where something is probably going to go wrong.

Q4.  Can you please give us an example of how to use in practice multiple availability zones and cross-region replication?

For distributed SQL databases, while different implementations use different names, they all work roughly like this: tables are split into slices. Rows are hashed and the hash decides which slice (and thus node) the row is stored in. Each slice has a replica. Replicas are put on different zones. Reads and writes happen using the ranking replica for that slice. This ensures an even distribution of data, reads, writes and that there is at least one replica on some other zone. When a node is lost, the system recreates replicas and rebalances. When nodes are added, slices are moved to rebalance the data. If workloads become unbalanced, then the system rebalances them automatically.

For cross region replication this happens either synchronously (an edge case as it is murderous on latency) or asynchronously (in the case of MariaDB Xpand, this is done in parallel with multiple nodes participating).  In either case, both regions can optionally handle reads and writes. If a region is lost or taken down, then the other region takes over its workload.

Q5.  Talking about human errors, what are the fault tolerance levels built into a distributed SQL that can help recovering from such errors? 

Some kinds of network, storage and other infrastructure configuration issues can be handled like faults. However, when someone overwrites data, that is dutifully replicated by the database so everywhere is as wrong as everywhere else. This is where transactionally safe backups come into play. In the case of Xpand, they are done in parallel and online. 

Qx. Anything else you wish to add? 

If you’d like to try it, Xpand is MariaDB’s distributed SQL database and  try it for free on either AWS or GCP  ($500 credit). Xpand is compatible with MySQL and MariaDB.


Andrew C. Oliver is a columnist and software developer with a long history in open source, databases, and cloud computing. He founded Apache POI and served on the board of the Open Source Initiative. Oliver also helped with marketing in startups including JBoss, Lucidworks, and Couchbase. He is currently the Senior Director of Product Marketing for MariaDB Corporation.

Sponsored by MariaDB


(*) Uptime Institute’s 2022 Outage Analysis Finds Downtime Costs and Consequences Worsening as Industry Efforts to Curb Outage Frequency Fall Short

You may also like...