On Distributed SQL Databases. Q&A with Matt White

Q1. What is a distributed SQL database?

Any SQL database that can recruit computing resources from more than one network node to satisfy a query. That’s a pretty broad definition and of course there are solutions with a wide range of distributed-ness.

Q2.What are the main benefits of replicating data across multiple servers?

Scalability and fault tolerance are of course the big ones. Once you start digging a little deeper, there are aspects of scalability that can really shine in cloud environments, such as the ability to scale up database performance during peak seasons (e.g. the winter holidays for retailers) and then scale back to a much smaller footprint the rest of the year.

Q3. Raft and Paxos are well known distributed consensus algorithms to achieve consensus across multiple nodes , which one would you suggest works best and why?

MariaDB Xpand uses Paxos because at the time it was developed Paxos was the leading distributed consensus algorithm. 

In terms of functionality, they’re roughly equivalent. Raft has the advantage of being easier to understand, which makes it easier to build a correct implementation. 

There are some performance optimizations available to Paxos that may not work for Raft.

Q4. One of the known challenges that SQL servers face is scaling out. What are the best solutions available?

MariaDB Xpand!

This is a big question and there’s a lot of nuance to it. For today, I’m going to assume that by “scaling out”, we mean “using clustered computers to increase transactional throughput while maintaining some reasonable latency budget”, because I tend to think about OLTP systems more than OLAP systems.

For this definition of scale-out, there are three major barriers:

  1. Making sure that all the compute resources are as fully utilized as possible.
  2. Minimizing the total amount of work done per transaction.
  3. Doing all this while maintaining transparency to the application.

The big barrier to (1) tends to be the extra network latency introduced by clustered computing solutions. We addressed this in Xpand by engineering from the ground up for massive concurrency so that we can hide this network latency by working on many transactions simultaneously. Some examples of things we’ve done here are heavy reliance on lockless data structures, lightweight multitasking models and extremely aggressive buffer management.

What systems do to minimize the amount of work they do per transaction is going to be specific to the system in question. The big innovation that Xpand brought to the table here is that we leverage the interrogative nature of SQL to ship queries to where the data is instead of pulling the data to where the query is being processed. This also is the founding principle of the system so the query compiler can reason relatively deeply about the location of data, breaking the query up into query fragments that stream partial results horizontally through the cluster without relying on any node that could become a bottleneck.

And, of course, we do this while maintaining the facade of a single instance database so that applications can make use of the clustered database without needing to be re-engineered.

Q5. It has been reported that a system claims the ACID properties of a database does not necessarily mean it is truly a transactional OLTP system. What is your take on this?

ACID implies strong transactional consistency, so if a database claims ACID but can’t guarantee transactional consistency, I would claim that it is not truly ACID.

Q6. Kubernetes together with distributed SQL. What is your take on this?

It’s certainly not necessary for a successful distributed SQL deployment. Kubernetes has some interesting qualities when it comes to cloud database management. While Xpand’s distributed architecture is designed to tolerate infrastructure failures, the self-healing capabilities of Kubernetes enhance the availability and resiliency of Xpand on SkySQL. In the latest release of SkySQL, Kubernetes also helps us deliver autonomous scale. For Xpand in SkySQL, that means that users can determine the parameters for when more or less scale is needed. SkySQL automatically adds or removes Xpand nodes when those parameters are met.  

Q7. MariaDB Xpand is a distributed SQL database. How does it compare with respect to other Distributed SQL systems?

It’s faster.

MariaDB Xpand is many things, but chief amongst them has to be that it is a compiler technology. Application queries are compiled into program fragments that communicate in a peer-to-peer fashion to implement the query. These program fragments are shipped around the cluster to where the data is located, limiting the amount of user data that has to be shipped between nodes. This allows us to efficiently implement things like parallel join, correlated subqueries and multilevel aggregation that can be expensive in other systems.

Our compiler technology is also capable of maintaining replicated data within the cluster without running a consensus algorithm on our rowstores which, along with our close-to-the-metal optimizations, gives us a performance advantage for OLTP workloads (https://mariadb.com/resources/blog/mariadb-xpand-crunches-cockroach-with-sysbench/).

Q8. How does Xpand tolerate infrastructure failures and maintain availability?

Xpand tolerates failures up to a MAX_FAILURES setting, which defaults to a single simultaneous failure. Xpand nodes can be configured into zones, which count as a single failure should the zone fail. While the node(s) are out of quorum, writes to the surviving replicas are logged up until a configurable timeout is reached. If the failed nodes return to quorum before this timeout, the writer logs are replayed against the data in the returning nodes to bring them back up to date. Should the nodes be absent for longer than their timeout, data is replicated to the surviving nodes automatically to maintain the MAX_FAILURES policy.

Network partitions are handled by requiring that a majority of nodes be available to form a quorum.

Q9. Xpand can “store copies of data in multiple availability zones within a region when running in the cloud, or in multiple racks when running on-premises”. What does it mean in practice?

When an Xpand deployment has defined availability zones, the loss of an entire zone is considered a single failure. With the default configuration, this means that a zone can go down without loss of availability. Our literature specifies “multiple racks” because we wanted to call out that there are other kinds of correlated failures that may be specific to your infrastructure. Even in a single data center, it can be useful to separate a cluster into zones to avoid these kinds of failures.

Q10. Xpand uses “parallel streaming replication between regions/data centers to support both active/active and active/passive global deployments”. What does it mean in practice?

Parallel replication is our asynchronous replication mechanism for streaming database updates from one Xpand cluster to another. It works in both primary/secondary and primary/primary mode like MariaDB server’s replication. We developed this because single stream replication can’t ship data fast enough to keep up with the work the primary cluster is doing. In practice, this means that customers can maintain hot DR clusters in case of catastrophic failure of the primary cluster (e.g. when required by corporate governance).

In a test for an active-active financial application using parallel replication between US-WEST and US-EAST, we saw 16.5k writes per second with sub-second replication lag.

Q11. When writing a SQL query and/or a SQL update do I need to know the network and servers distribution?

No. That’s all hidden from the client. The idea is that the Xpand cluster appears to the client as a single node database. 

Q12. What resources are available for Xpand?

To learn more about how Xpand works, check out our blog where we examine distributed continuation, distributed joins and distributed aggregates. When you’re ready to give it a try, we have two easy ways to get started. Our single node docker image is a great way for developers to test application compatibility and learn about the Xpand features. For production environments, performance testing and to experience its full scale out powers, try Xpand in MariaDB SkySQL, our fully managed cloud database service. Instructions to get started are here.


Matt White has been working on distributed databases for over 12 years. He is currently Senior Director of Engineering at MariaDB where he leads the engineering team for Xpand, MariaDB’s distributed SQL database. Before working at MariaDB, he worked at CloudShield Technologies, IP Fabrics, Inc and Ericsson IP Infrastructure.

Sponsored by MariaDB

You may also like...