Scaling MySQL and MariaDB to TBs: Interview with Martín Farach-Colton.
“While I believe that one size fits most, claims that RDBMS can no longer keep up with modern workloads come in from all directions. When people talk about performance of databases on large systems, the root cause of their concerns is often the performance of the underlying B-tree index”— Martín Farach-Colton.
Scaling MySQL and MariaDB to TBs and still be performant. Possible? I did interview on this topic Martín Farach-Colton, Tokutek Co-founder & Chief Technology Officer.
RVZ
Q1. What is TokuDB?
Farach-Colton: TokuDB® is a storage engine for MySQL and MariaDB that uses Fractal Tree Indexes. It enables MySQL and MariaDB to scale from GBs to TBs and provides improved query performance, excellent compression, and online schema flexibility. TokuDB requires no changes to MySQL applications or code and is fully ACID and MVCC complaint.
While I believe that one size fits most, claims that RDBMS can no longer keep up with modern workloads come in from all directions. When people talk about performance of databases on large systems, the root cause of their concerns is often the performance of the underlying B-tree index. This makes sense, because almost every system out there that indexes big data does so with a B-tree. The B-tree was invented more than 40 years ago and has been fighting hardware trends ever since.
For example, although some people really do have OLAP workloads and others really do have OLTP workloads, I believe that most users are forced to choose between analytics and live updates by the performance profile of the B-tree, and that their actual use case requires the best of both worlds. The B-tree forces compromises, which the database world has internalized.
Fractal Tree Indexes® replace B-tree indexes and are based on my academic research on so-called Cache-Oblivious Analysis, which I’ve been working on with my Tokutek co-founders, Michael Bender and Bradley Kuszmaul. Fractal Tree Indexes speed up indexing — that is, they are write optimized — without giving up query performance. The best way to get fast queries is to define the right set of indexes, and Fractal Tree Indexes let you do that. Legacy technologies have staked out claims on some part of the B-tree query/indexing tradeoff curve. Fractal Tree Indexes put you on a higher-performing tradeoff curve. Query-optimal write-optimized indexing is all about making general-purpose databases faster. For some of our customers’ workloads, it’s as much as two orders of magnitude faster.
Q2. You claim to be able to “scale MySQL from GBs to TBs while improving insert and query speed, compression, replication performance, and online schema flexibility.” How do you do that?
Farach-Colton: In a Fractal Tree Index, all changes — insertions, deletions, updates, schema changes — are messages that get injected into the tree. Even though these messages get injected into the root and might get moved several times before getting to a leaf, all queries will see all relevant messages. Thus, injecting a message is fast, and queries reflect the effects of a message, e.g., changing a field or adding a column, immediately.
In order to make indexing fast, we make sure that each I/O to disk does a lot of work. I/Os are slow. For example, if accessing the fastest part of memory is like walking across a room, getting something from disk is like walking from New York to St Louis. The analogy in Germany would be to walk from Berlin to Rome — Germany itself isn’t big enough to hold a disk I/O on this scale!
To see how Fractal Tree Indexes work, first consider a B-tree. A B-tree delivers each item to its final destination as it arrives. It’s as if Amazon were to put a single book in a truck and drive it to your home. A Fractal Tree Index manages a set of buffers — imagine regional warehouses — so that whenever you move a truck, it is full of useful information.
The result is that insertion speed is dramatically higher. Query speeds are higher because you keep more indexes. And compression is higher because we can afford to read and write data in larger chunks. Compression on bigger chunks of data is typically more effective, and that rule of thumb has panned out for our customers.
Q3 Do you have any benchmark results to support your claims? If yes, could you please give us some details?
Farach-Colton: We have a benchmarking page. Our solutions page gives details of performance in the field from out customers.
Some of the highlights include up to 80x insertion performance, up to 25x compression, schema changes that take seconds instead of hours, and dramatic savings in overall disk I/O utilization.
Q4. You developed your own benchmark iiBench, why not use TPC benchmarks?
Farach-Colton: Our benchmarking page also shows performance on TPC and Sysbench workloads. iiBench is meant to expand the range of comparison for storage engines, not to replace existing benchmarks.
Benchmarks are developed to measure performance where vendors differ. iiBench measures how a database performs on a mixed insertion/query workload. B-trees are notoriously slow at updating indexes on the fly, so no one has bothered to produce a benchmark that measures performance on such workloads. We believe (and our customers confirm for us) that this Real-time OLAP (aka Analytical OLTP) workload is critical. Furthermore, iiBench seems to be filling a need in that there has been some third-party adoption, most notably by Mark Callaghan at Facebook.
Q5. How is iiBench defined? What kind of queries did you benchmark and with which data size?
At best, all I could do is summarize it here and risk omitting details important to some of your readers. Instead, let me refer you to our website where you can read it in detail here , and the Facebook modifications can be found here.
Q6. How do you handle MySQL slave lag (delays)?
Farach-Colton: The bottleneck that causes slave lag is the single-threaded nature of master-slave replication. InnoDB has single-threaded indexing that is much slower than its multi-threaded indexing. Thus, the master can support a multi-threaded workload that is substantially higher than that achievable on the single-threaded slave. (There are plans to improve this in upcoming releases of InnoDB.)
TokuDB has much higher insertion rates, and in particular, very high single-threaded insertion rates. Thus, the slave can keep up with a much more demanding workload. (Please see our benchmark page for details.)
Q7. You have developed a technique called “Fractal Tree indexes.” What is it? How does it compare with standard Tree indexes?
Farach-Colton: Much of this question was answered in question 2. Here, I’ll add that Fractal Tree Indexes are the first write-optimized data structure used as an index in databases. The Log-Structured Merge Tree (LSM) also achieves very high indexing rates, but at the cost of so-called read amplification, which means that indexing is much faster than for B-trees, but queries end up being much slower.
Fractal Tree Indexes are as fast as LSMs for indexing, and as fast as B-trees for queries. Thus, they dominate both.
A final note on SSDs: B-trees are I/O bound, and SSDs have much higher IOPS than rotational disks. Thus, it would seem that B-trees and SSDs are a marriage made in heaven. However, consider the fact that when a single row is modified in a B-tree leaf, the entire leaf must be re-written to disk. So, in addition to the problems of write amplification caused by wear leveling and garbage collection on SSDs, B-trees induce a much greater level of write amplification, because small changes induce large writes. Write-optimized indexes largely eliminate both types of write amplification, as well as potentially extending the life of SSDs.
Q8. What is special about your data compression technique?
Farach-Colton: Compression is like a freight train, slow to get started but effective once it’s had a chance to get up to speed. Thus, one of the biggest factors for how much compression is achieved is how much you compress at a time. Give a compression algorithm some data 4KB at a time, and the compression will be poor. Give it 1MB or more at a time, and the same compression algorithm will do a much better job. Because of the structure of our indexes, we can compress larger blocks at a time.
I should also point out that one of the problems that dogs databases is aging or fragmentation, in which the leaves of the B-tree index get scattered on disk, and range queries become much slower. This is because leaf-to-leaf seeks become slower, when the leaves are scattered. When leaves are clustered, they can be read quickly by exploiting the prefetching mechanisms of disks. The standard solution is to periodically rebuild the index, which can involve substantial down time. Fractal Tree indexes do not age, under any workload, for the same reason that they compress so well: by handling much larger blocks of data than a B-tree can, we are able to keep the data much more clustered, and thus it is never necessary to rebuild an index.
Q9. How is TokuDB different with respect to SchoonerSQL, NuoDB and VoltDB?
Farach-Colton: VoltDB is an in-memory database. The point is therefore to get the highest OLTP performance for data that is small enough to fit in RAM.
By comparison, TokuDB focuses on very large data sets — too big for RAM — for clients who are interested in moving away from the OLAP/OLTP dichotomy, that is, for clients who want to keep data up to date while still querying it through rich indexes.
SchoonerSQL has a combination of technologies involving optimization for SSDs as well as scale-out. TokuDB is a scale-up solution that improves performance on a per-node basis.
NuoDB is also a scale-out solution, in this case cloud-based, though I am less familiar with the details of their innovation at a more technical level.
——–
Martín Farach-Colton is a co-founder and chief technology officer at Tokutek. He was an early employee at Google, where he worked on improving performance of the Web Crawl and where he developed the prototype for the AdSense system.
An expert in algorithmic and information retrieval, Prof. Farach-Colton is also a Professor of Computer Science at Rutgers University.
Related Posts
– Interview with Mike Stonebraker. by Roberto V. Zicari on May 2, 2012
##
“In a Fractal Tree Index, all changes — insertions, deletions, updates, schema changes — are messages that get injected into the tree.”
Is there a research or white paper on which Fractal Tree Index is based on? Personally, it is too demanding to understand how immediate changes due to message injections differ from immediate changes due mutating contents of a memory location.
@Vilho,
In a traditional OLTP approach, if you need to modify part of a B-tree, but the part you need to modify is on disk, then you end up doing a full round-trip to disk for a single modification. On the other hand, approaches like OLAP batch up changes and apply them in a group. This makes the updates much more memory efficient, but the changes aren’t available for a while.
The advantage of fractal trees is that the changes to tree are scheduled so that they are in bunches, and therefore fast (like OLAP) but the changes are on the path of queries, so they affect the queries immediately (like OLTP).
So you get fast updates but also immediate updates. High throughput and lower latency, if you will.
The most dramatic illustration is for messages that change a schema. For us, it’s a broadcast message that gets injected into the root and makes its way down every path to every leaf, eventually. We can schedule these message movements when we have lots of updates to make, and so the cost to do update is small, yes, since it’s just a message injection into the root to notify all queries of the change, the effect is immediate. Using traditional approaches, every leaf would be fetched into memory and changed, even leaves that are never read again. This is why InnoDB can have hours or even days of down time to add a column, whereas TokuDB has no down time and no noticeable performance hit for adding a column
There are several videos of varying lengths available at http://www.tokutek.com/resources/technology/.
The quick version is at http://www.youtube.com/watch?v=DFT3DyUEVJU (8 minutes) and a longer presentation can be found at http://vimeo.com/26471692.
Hope these help.
Martin
I don’t know why you are asking my opinion. I’m not familiar with databases. My focus is on digital image. But any way I can give you some of my thoughts.
Normally seems to be so that digital image databases are quite locked and they work only in the archive house. Often is so that if you take picture out of database the image there is only the image as it self and nothing more. No information over anything. Digital database program has wiped out all information. The picture file contains only minor data such as:
name plus extender=format
dimensions
resolution
file size
Customer might want to know at least:
color profile
deep of compressing if any
description
photographer name
date and time of creation
The other way is to use metadata features of an modern digital image. It is not so common knowledge that there is a possibleties to store information in side the file. This is ASCII so its doesn’t take room. The best part is that biggest amount of this kind of information can be automatically saved. There is several parts of metadata:
Exif – technical information, which contains basic image data, names and version of used programs, thumbnale picture and so on (automatically)
IPTC – archive data, which contains photographers name, descriptions, copyright notes, owner and storing place (must be write down)
XMP – partly both, but also you can store image handling process (Adobe’s own data format)
this is constantly growing area – geodata (now adays also shooting direction), face recognition and so on (automatically partly=must be proved by human)
In this new system metadata will ride with image all over the world. You don’t need heavy database system. Information is possible to read from images. But if archive authorities will they might want keep storing database. Also original images needs place to keep.
Now a days I would keep raw-dng store. It is possible save metadata already in raw file. You need only develop pictures out from digital negatives. Tiff store is not necessarily needed. In that way you spare a lot of storing room.
@Martin, thanks. I found good material about it.