Scaling MySQL and MariaDB to TBs: Interview with Martín Farach-Colton.
“While I believe that one size fits most, claims that RDBMS can no longer keep up with modern workloads come in from all directions. When people talk about performance of databases on large systems, the root cause of their concerns is often the performance of the underlying B-tree index”— Martín Farach-Colton.
Scaling MySQL and MariaDB to TBs and still be performant. Possible? I did interview on this topic Martín Farach-Colton, Tokutek Co-founder & Chief Technology Officer.
Q1. What is TokuDB?
Farach-Colton: TokuDB® is a storage engine for MySQL and MariaDB that uses Fractal Tree Indexes. It enables MySQL and MariaDB to scale from GBs to TBs and provides improved query performance, excellent compression, and online schema flexibility. TokuDB requires no changes to MySQL applications or code and is fully ACID and MVCC complaint.
While I believe that one size fits most, claims that RDBMS can no longer keep up with modern workloads come in from all directions. When people talk about performance of databases on large systems, the root cause of their concerns is often the performance of the underlying B-tree index. This makes sense, because almost every system out there that indexes big data does so with a B-tree. The B-tree was invented more than 40 years ago and has been fighting hardware trends ever since.
For example, although some people really do have OLAP workloads and others really do have OLTP workloads, I believe that most users are forced to choose between analytics and live updates by the performance profile of the B-tree, and that their actual use case requires the best of both worlds. The B-tree forces compromises, which the database world has internalized.
Fractal Tree Indexes® replace B-tree indexes and are based on my academic research on so-called Cache-Oblivious Analysis, which I’ve been working on with my Tokutek co-founders, Michael Bender and Bradley Kuszmaul. Fractal Tree Indexes speed up indexing — that is, they are write optimized — without giving up query performance. The best way to get fast queries is to define the right set of indexes, and Fractal Tree Indexes let you do that. Legacy technologies have staked out claims on some part of the B-tree query/indexing tradeoff curve. Fractal Tree Indexes put you on a higher-performing tradeoff curve. Query-optimal write-optimized indexing is all about making general-purpose databases faster. For some of our customers’ workloads, it’s as much as two orders of magnitude faster.
Q2. You claim to be able to “scale MySQL from GBs to TBs while improving insert and query speed, compression, replication performance, and online schema flexibility.” How do you do that?
Farach-Colton: In a Fractal Tree Index, all changes — insertions, deletions, updates, schema changes — are messages that get injected into the tree. Even though these messages get injected into the root and might get moved several times before getting to a leaf, all queries will see all relevant messages. Thus, injecting a message is fast, and queries reflect the effects of a message, e.g., changing a field or adding a column, immediately.
In order to make indexing fast, we make sure that each I/O to disk does a lot of work. I/Os are slow. For example, if accessing the fastest part of memory is like walking across a room, getting something from disk is like walking from New York to St Louis. The analogy in Germany would be to walk from Berlin to Rome — Germany itself isn’t big enough to hold a disk I/O on this scale!
To see how Fractal Tree Indexes work, first consider a B-tree. A B-tree delivers each item to its final destination as it arrives. It’s as if Amazon were to put a single book in a truck and drive it to your home. A Fractal Tree Index manages a set of buffers — imagine regional warehouses — so that whenever you move a truck, it is full of useful information.
The result is that insertion speed is dramatically higher. Query speeds are higher because you keep more indexes. And compression is higher because we can afford to read and write data in larger chunks. Compression on bigger chunks of data is typically more effective, and that rule of thumb has panned out for our customers.
Q3 Do you have any benchmark results to support your claims? If yes, could you please give us some details?
Q4. You developed your own benchmark iiBench, why not use TPC benchmarks?
Farach-Colton: Our benchmarking page also shows performance on TPC and Sysbench workloads. iiBench is meant to expand the range of comparison for storage engines, not to replace existing benchmarks.
Benchmarks are developed to measure performance where vendors differ. iiBench measures how a database performs on a mixed insertion/query workload. B-trees are notoriously slow at updating indexes on the fly, so no one has bothered to produce a benchmark that measures performance on such workloads. We believe (and our customers confirm for us) that this Real-time OLAP (aka Analytical OLTP) workload is critical. Furthermore, iiBench seems to be filling a need in that there has been some third-party adoption, most notably by Mark Callaghan at Facebook.
Q5. How is iiBench defined? What kind of queries did you benchmark and with which data size?
At best, all I could do is summarize it here and risk omitting details important to some of your readers. Instead, let me refer you to our website where you can read it in detail here , and the Facebook modifications can be found here.
Q6. How do you handle MySQL slave lag (delays)?
Farach-Colton: The bottleneck that causes slave lag is the single-threaded nature of master-slave replication. InnoDB has single-threaded indexing that is much slower than its multi-threaded indexing. Thus, the master can support a multi-threaded workload that is substantially higher than that achievable on the single-threaded slave. (There are plans to improve this in upcoming releases of InnoDB.)
TokuDB has much higher insertion rates, and in particular, very high single-threaded insertion rates. Thus, the slave can keep up with a much more demanding workload. (Please see our benchmark page for details.)
Q7. You have developed a technique called “Fractal Tree indexes.” What is it? How does it compare with standard Tree indexes?
Farach-Colton: Much of this question was answered in question 2. Here, I’ll add that Fractal Tree Indexes are the first write-optimized data structure used as an index in databases. The Log-Structured Merge Tree (LSM) also achieves very high indexing rates, but at the cost of so-called read amplification, which means that indexing is much faster than for B-trees, but queries end up being much slower.
Fractal Tree Indexes are as fast as LSMs for indexing, and as fast as B-trees for queries. Thus, they dominate both.
A final note on SSDs: B-trees are I/O bound, and SSDs have much higher IOPS than rotational disks. Thus, it would seem that B-trees and SSDs are a marriage made in heaven. However, consider the fact that when a single row is modified in a B-tree leaf, the entire leaf must be re-written to disk. So, in addition to the problems of write amplification caused by wear leveling and garbage collection on SSDs, B-trees induce a much greater level of write amplification, because small changes induce large writes. Write-optimized indexes largely eliminate both types of write amplification, as well as potentially extending the life of SSDs.
Q8. What is special about your data compression technique?
Farach-Colton: Compression is like a freight train, slow to get started but effective once it’s had a chance to get up to speed. Thus, one of the biggest factors for how much compression is achieved is how much you compress at a time. Give a compression algorithm some data 4KB at a time, and the compression will be poor. Give it 1MB or more at a time, and the same compression algorithm will do a much better job. Because of the structure of our indexes, we can compress larger blocks at a time.
I should also point out that one of the problems that dogs databases is aging or fragmentation, in which the leaves of the B-tree index get scattered on disk, and range queries become much slower. This is because leaf-to-leaf seeks become slower, when the leaves are scattered. When leaves are clustered, they can be read quickly by exploiting the prefetching mechanisms of disks. The standard solution is to periodically rebuild the index, which can involve substantial down time. Fractal Tree indexes do not age, under any workload, for the same reason that they compress so well: by handling much larger blocks of data than a B-tree can, we are able to keep the data much more clustered, and thus it is never necessary to rebuild an index.
Q9. How is TokuDB different with respect to SchoonerSQL, NuoDB and VoltDB?
Farach-Colton: VoltDB is an in-memory database. The point is therefore to get the highest OLTP performance for data that is small enough to fit in RAM.
By comparison, TokuDB focuses on very large data sets — too big for RAM — for clients who are interested in moving away from the OLAP/OLTP dichotomy, that is, for clients who want to keep data up to date while still querying it through rich indexes.
SchoonerSQL has a combination of technologies involving optimization for SSDs as well as scale-out. TokuDB is a scale-up solution that improves performance on a per-node basis.
NuoDB is also a scale-out solution, in this case cloud-based, though I am less familiar with the details of their innovation at a more technical level.
Martín Farach-Colton is a co-founder and chief technology officer at Tokutek. He was an early employee at Google, where he worked on improving performance of the Web Crawl and where he developed the prototype for the AdSense system.
An expert in algorithmic and information retrieval, Prof. Farach-Colton is also a Professor of Computer Science at Rutgers University.