Measuring the scalability of SQL and NoSQL systems.
“Our experience from PNUTS also tells that these systems are hard to build: performance, but also scaleout, elasticity, failure handling, replication.
You can’t afford to take any of these for granted when choosing a system.
We wanted to find a way to call these out.”
— Adam Silberstein and Raghu Ramakrishnan, Yahoo! Research.
A team of researchers composed of Adam Silberstein, Brian F. Cooper, Raghu Ramakrishnan, Russell Sears, and Erwin Tam, all at Yahoo! Research Silicon Valley, developed a new benchmark for Cloud Serving Systems, called YCSB.
They published their results in the paper “Benchmarking Cloud Serving Systems with YCSB” at the 1st ACM Symposium on Cloud Computing, ACM, Indianapolis, IN, USA (June 10-11, 2010).
They open-sourced the benchmark about a year ago.
The YCSB benchmark appears to be the best to date for measuring the scalability of SQL and NoSQL systems.
Together with my colleague and odbms.org expert, Dr. Rick Cattell, we interviewed Adam Silberstein and Raghu Ramakrishnan.
Hope you`ll enjoy the interview.
Rick Cattell and Roberto V. Zicari
Q1. What motivated you to write a new database benchmark?
Silberstein, Ramakrishnan: Over the last few years, we observed an explosive rise in the number of new large-scale distributed data management systems: BigTable, Dynamo, HBase, Cassandra, MongoDB, Voldemort, etc. And of course our own PNUTS system .
This field expanded so quickly that there is no agreement on what to call these systems: NoSQL systems, cloud databases, key-value stores, etc. (Some of these terms (e.g., cloud databases) are even applied to Map-Reduce systems such as Hadoop, which are qualitatively different, leading to terminological confusion.)
This trend created a lot of excitement throughout the community of web application developers and among data management developers and researchers. But it also created a lot of debate. Which systems are most stable and mature? Which have the best performance? Which is best for my use case? We saw these questions being asked in the community, but also within Yahoo!.
Our experience with PNUTS tells us there are many design decisions to make when building one of these systems, and those decisions have a huge impact on how the system performs for different workloads (e.g., read-heavy workloads vs. write-heavy workloads), how it scales, how it handles failures, ease of operation and tuning, etc. We wanted to build a benchmark that lets us expose the nuances of each these decisions and their implementations.
Our initial experiences with systems in this space made it very clear that tuning the systems is a challenging problem requiring expert advice from the systems’ developers. Out-of-the-box, systems might be tuned for small-scale deployment, or running batch jobs rather than serving. By creating a serving benchmark, we could create a common “language” for describing the type of workloads web application developers care about, which in turn might bring about tuning best practices for these workloads.
-The space of cloud databases/nosql systems/key-value stores exploded very quickly: HBase, Cassandra, MongoDB, Voldemort…even the space of names describing such systems grew quickly (nosql, cloud db, etc.).
This created a lot of excitement and confusion, both in Yahoo and externally (what do they do, are they stable/mature, what is performance like, etc).
-Each system has very nice things to say about itself. Wanted to make sense of the space so we could learn for ourselves, give advice, etc. Wanted a way for these systems to speak a common language when describing their capabilities.
Our experience from PNUTS also tells that these systems are hard to build: performance, but also scaleout, elasticity, failure handling, replication. You can’t take afford to take any of these for granted when choosing a system. We wanted to find a way to call these out.
We also quickly learned that tuning these systems is difficult. Out-of-box, they might be tuned for small clusters, rather than large (or vice-versa), for different hardware, etc.
And the systems aren’t at the point where they have detailed tuning guides/best practices. While we planned to open-source all along, this motivated us further: put the tuning problem in the hands of the experts by giving them the workload.
Q2. In your paper  you write “The main motivation for developing new cloud serving systems is the difficulty in providing scale out and elasticity using traditional database systems”. Can you please explain this statement? What about databases such as object oriented and XML-based?
Silberstein, Ramakrishnan: Database systems have addressed scale-out, and commercial systems (e.g., high-end systems from the major vendors) do scale out well. However, they typically do not scale to web-scale workloads using commodity servers, and are not designed to be operated in configurations of 1000s of servers, and to support high-availability and geo-replication in these settings.
For example, how does ACID carry over when you scale the traditional systems across data centers? Further, they do not support elastic expansion of a running system—When adding servers, how do you offload data to them? How do you replicate for fault tolerance? The new systems we see are making entirely new design tradeoffs. For example, they may sacrifice much of ACID (e.g., it’s strong consistency model), but make it easy and cost-effective to scale to a large number of servers with high-availability and elastic expandability.
These considerations are largely orthogonal to the underlying data model and programming paradigm (i.e., XML or object-oriented systems), though some of the newer systems have also innovated in the areas of flexible schema evolution and nested columns.
Q3. What is the difference between your YCSB benchmark and well known benchmarks such as TPC-C?
Silberstein, Ramakrishnan: At a high level, we have a lot in common with TPC-C and other OLTP benchmarks.
We care about query latency and overall system throughput. When take a closer look, however, the queries are very different. TPC-C contains several diverse types of queries meant to mimic a company warehouse environment. Some queries execute transactions over multiple tables; some are more heavyweight than others.
In contrast, the web applications we are benchmarking tend to run a huge number of extremely simple queries. Consider a table where each record holds a user’s profile information. Every query touches only a single record, likely either reading it, or reading+writing it. We do include support for skewed workloads; some tables may have active sets accessed much more than others. But we have focused on simple queries, since that is what we see in practice.
Another important difference is that we have emphasized the ease of creating a new suite of benchmarks using the YCSB framework, rather than the particular workload that we have defined, because we feel it is more important to have a uniform approach to defining new benchmarks in this space, given its nascent stage.
Q4. In your paper (1) you focus your work on two properties of “Cloud” systems: “elasticity” and “performance”, or as you write “simplified application development and deployment”. Can you please explain what do you mean with these two properties?
Silberstein, Ramakrishnan: – Performance refers to the usual metrics of latency and throughput, with the ability to scale out by adding capacity. Elasticity refers to the ability to add capacity to a running deployment “on-demand”, without manual intervention (e.g., to re-shard existing data across new servers).
Q5. Cloud systems differ for example in the data models they support (e.g., column-group oriented, simple hashtable based, document based). How do you compare systems with such different data models?
Silberstein, Ramakrishnan: In our initial work we focused on the commonalities among the systems we benchmarked (PNUTS, Cassandra, HBase). We ran the systems as row stores holding ordered data.
The YCSB “core workload” accesses entire records at a time, and executes range queries over small portions of the table.
Though Cassandra and HBase both support column-groups, we did not exercise this feature.
It is of course possible to design workloads that test features like column groups, and we encourage users to do so—as we noted earlier, one of our goals was to make it easy to add benchmarks using the YCSB framework. One way to compare systems with different feature sets is to disqualify systems lacking the desired features.
Rather than disqualify a system because it lacks column groups, for example, it may make sense to compare the system as-is against others with column groups.
It is then possible to measure the cost of reading all columns, even when only one is needed. On the other hand, if a workload must execute range scans over a continuous set of of keys, there is no reasonable way to run that workload against a hash-based store.
Q6. You write that the focus of your work is on “serving” systems, rather than “batch” or “analytical”. Are there any particular reasons for doing that? What are the challenges in defining your benchmark?
Silberstein, Ramakrishnan: This is analogous to having TPC-C and TPC-H benchmarks in the conventional database setting.
Many prominent systems in cloud computing, either target serving workloads or target analytical/batch workloads. Hadoop is the most well-known example of a batch system. Some systems, such as HBase, support both. We believe serving and batch are extremely important, but very different. Serving needs its own treatment and we wanted to very clearly call that out in our benchmark. It is easy to confuse the two when looking at benchmark results. In both settings we talk a lot about throughput, e.g., number of records read or written per second.
But in batch, where the workload reads or writes an entire table, those records are likely sequentially on disk. In serving, the workload does many random reads and writes, so those records are likely scattered on disk. Disk I/O patterns have a huge impact on throughput, and so we must understand whether a workload is batch or serving before we can appreciate the benchmark results.
-These are really different spaces, although some applications combine both. We believe both are important in the cloud space.
-One reason this space gets confusing is because most systems we’ve looked at have some batch functionality (often via Hadoop integration). (We know of cases where YCSB is used to drive both a serving and batch workload). And there are even use cases that require both. But there are many important use cases that do only serving, and in this work we wanted to benchmark that specifically.
-Another reason we started with just serving is because it is very easy to confuse the two when discussing performance.
For example, in both settings we can talk about throughput numbers (ops/second).
For serving you are probably talking about doing a huge number of operations on a random set of records. For batch you are talking about doing 1 operation on a huge number of records, and they are probably physically clustered together. So we’re talking about, for example, random reads vs. sequential scans. The numbers will be very different and its important to look at the results for the case you care about.
Q7. Your YCSB benchmark consists of two tiers: Tier-1 Performance and Tier-2 Scaling. Can you please explain what these two Tiers are and what do you expect to measure with them?
Silberstein, Ramakrishnan: The performance tier is the bread-and-butter of our initial YCSB work; for a fixed system size, we want to see how much load we can put on each system while still getting low latency.
This is one of the key questions, if not the key question, an application developer asks when choosing a data system: how much of my expected workload does the system support per-server, and so how many servers am I going to have to buy? No matter what system we benchmark, the performance results have a similar feel. We plot throughput on the x-axis and (average or 95th percentile) latency on the y-axis. As we push throughput higher, latency grows gradually. Eventually, as the system the system reaches saturation, latency jumps dramatically, and then we can push throughput no higher. It is easy to compare these graphs across systems and get a feel for what kind of load each can support. It is also worthwhile to verify that while systems slow down at saturation point, they nonetheless remain stable.
The big selling point of cloud serving systems is their ability to scale upward by adding more servers. If the system offers low latency at small scale, as we proportionally increase workload size and the number of servers, latency should remain constant. If this is not the case, this is a hint the system might have bottlenecks that surface at scale.
In our scaling tier we also make the point of distinguishing scalability and elasticity. While good scalability means the system runs well over large workloads if pre-configured with the appropriate number of servers, elasticity means the system can actually grow from small to large scale by adding servers while remaining online.
In our initial work we observed systems doing well on scalability, but having erratic behavior when we added capacity online and increased workloads (i.e., they were often weak on elasticity).
Q8. How are the workloads defined in your YCSB benchmark? How do they differ from traditional TCP-C workloads?
Silberstein, Ramakrishnan: Our workloads consist of much simpler queries than TPC-C. There are just a handful of knobs for the user to adjust. The user may specify size settings such as existing database size and record size, and distribution settings, such as the relative proportions of insert, reads, updates, delete, and scan operations and distribution of accesses over the key space. These simple settings let us characterize many important workloads we find at Yahoo!, and we provide a few simple ones as part of our Core Workload.
Of course, users can develop much more complex workloads that look more like TPC-C, and our code is designed to make that easy.
Q9. You used your benchmark to measure four different systems: Cassandra, HBase, your cloud system PNUTS and your implementation of a sharded MySQL. Why did you choose these four systems? What other systems would you have liked to include, or would you like to see someone run YCSB on?
Silberstein, Ramakrishnan: When we started our work there were certainly more than four systems to choose from, and there are even more now. We limited ourselves to just a handful to avoid spreading ourselves too thin. This was a wise decision, since figuring out how to run and tune each system for serving was a full-time job (providing even more motivation to produce the benchmark). Ultimately, we made our choice of systems based on interest at Yahoo! and our desire to compare a collection of systems that made fundamentally different design decisions.
We built PNUTS here, so that is an obvious choice, and many Yahoo! developers are curious about the features and performance of Cassandra and HBase. PNUTS uses a buffer page based architecture, while Cassandra and HBase (itself a clone of BigTable) are based on differential-file architectures.
We are happy with our initial choice of systems. We got a lot of interest in our work and have gained a lot of users and contributors. Those contributors have done a variety of work, including improvements to clients of the systems we initially benchmarked and adding new clients. As of this writing, we have clients for Voldemort, MongoDB, and JDBC.
Q10. In your paper you present the main results of comparing these four systems. Can you please summarize the main results you obtained and the lessons learned?
Silberstein, Ramakrishnan: Our impact comes not from the actual numbers we collected; this was just a snapshot in time for each system. Our key contribution was creating an apples-to-apples environment that made ongoing comparisons feasible. We encouraged some competition between the systems and produced a tool for the systems to compare themselves against in the future.
That said, we had some interesting results. We know that the systems we tested made different design decisions to optimize reads or writes, and we saw these decisions reflected in the results. For a 50% read 50% write workload, Cassandra achieved the highest throughput. For a 95% read workload, PNUTS tied Cassandra on throughput while having better latency.
All systems we tested advertised scalability and elasticity. The systems all performed well at scale, but we noticed hiccups when growing elastically.
Cassandra took an extremely long time to integrate a new server, and had erratic latencies during that time. HBase is extremely lazy about integrating new servers, requiring background compactions to move partitions to them.
This was over a year ago, and these systems may well have overcome these issues by now.
-Actual numbers not the important thing here…this was just a snapshot in time of these systems. The key thing is that we created an apples-to-apples setup that made comparisons possible, created a bit of competition, and created something for the systems to evaluate themselves against.
-Result #1. We knew the systems made fundamental decisions to optimize writes or optimize reads.
It was nice to see these decisions show up in the results.
Example: in a 50/50 workload, Cassandra was best on throughput. In a 95% read workload, PNUTS caught up and had the best latencies.
-Result #2. The systems may advertise scalability and elasticity, but this is clearly a place where the implementations needed more work. Ref. elasticity experiment. Ref. HBase with only 1-2 nodes.
-Lesson. We are in the early stages. The systems are moving fast enough that there is no clear guidance on how to tune each system for particular workloads.
So figuring out how to do justice to each system is tricky.
Q11. Please briefly explain the experiment set up.
Silberstein, Ramakrishnan: We simply performed a bake-off. We had a collection of server class machines, and installed and ran each cloud serving system on them. We spent a huge amount of time configuring each system to perform their best on our hardware. We are in the early days of cloud serving systems, and there are no clear best practices for how to tune these systems to our workloads. We also took care to allocate memory fairly across each system.
For example, HBase runs on top of HDFS, and both HBase and HDFS must be given memory, and the sum must equal what we grant Cassandra.
Finally we made sure to load each system with enough data to avoid fitting all records in memory; all of the systems we benchmarked perform dramatically differently when running solely in memory vs. not.
We were more interested in performance in the latter case, so ran in that mode.
Q12. You plan to extend the benchmark to include two additional properties: availability and replication. Can you please explain how you intend to do so?
Silberstein, Ramakrishnan: These are tricky, but important, tiers to build. We want to measure a variety of things. What is the added cost of replication? What happens to availability under different failure scenarios? Can the records be read and/or written, and is there a cost penalty when doing so? What is the record consistency model during failures, and can we quantify the differences between different failure models?
These tiers expose areas where system designers can make very different design decisions. They might ensure write durability by replicating writes on-disk or in-memory only. They might replicate intra-data center or cross-data center. During network partitions, they might prioritize consistency and make data read-only, or might prioritize availability, and use eventual consistency to make record replicas converge later.
Q13. Your benchmark appears to be the best to date for measuring the scalability of SQL and NoSQL systems; it would be great if others would run it on their systems. Do you have ideas to encourage that to happen?
Silberstein, Ramakrishnan: We think we have been fairly successful here already. We open-sourced the benchmark about a year ago. It is available here.
Before we released the benchmark, we shared drafts of our results with the developer/user lists of each benchmarked system a few times prior to publication, and each time got a lot of interest and questions: what do our machines look like, what tuning parameters did we use, etc. This interest is a great sign. First, as you point out, we’re filling an unmet need.
Second, our audience recognizes our Yahoo! workloads are important and that we as Yahoo! researchers are doing a rigorous comparison.
By the time we released the benchmark we had many people waiting to use it.
Today, we have many users and contributors. We see comments on system mailing lists like “I am running YCSB Workload A and seeing XXX ops/sec and that seems too low. How should I tune my setup?”
Q14. A simpler benchmark might be easier for others to reproduce, getting more results. Is there a simple subset of your benchmark you’d suggest that captures most of the important elements?
Silberstein, Ramakrishnan: The Core Benchmark is already very simple. It uses synthetic data and provides just a few knobs for users to adjust. We and others users have done more complex testing like feed in actual production logs in YCSB, but that is not part of Core. Within Core, we should mention there are 3 very simple workloads that can execute on the simplest hash-based key value store. Workloads A, B, and C execute only inserts, reads, and updates. They do not execute range queries.
Q15. Open source NoSQL projects may not have half a dozen dedicated servers available to reproduce your results on their systems. Do you have suggestions there? Is it possible to run an accurate benchmark on a leased cloud platform?
Silberstein, Ramakrishnan: Funny you ask this because we felt the half dozen servers we ran on was not enough. We went to great lengths to borrow 50 homogeneous servers for a short time to run 1 large-scale experiment that showed YCSB itself scales.
Running YCSB on a leased platform like Amazon EC2 is straightforward.
The big question is determining if the results are accurate and repeatable. Certainly if the machines are VMs on non-dedicated servers, that might not be the case. We have heard if you request the largest size VMs, EC2 might allocate dedicated servers. One of the benefits of working at a large Internet company is that we have not had to try this ourselves!
Q16. Do you think that flash memory may “tip the balance” in the best data platforms? It cannot simply be treated as a disk, nor as RAM, because it has fundamentally different read/write costs.
Silberstein, Ramakrishnan: As SSD hardware has matured, we have noticed two trends. First, it is quite easy to build a machine with enough SSDs to run into CPU, bus and controller bottlenecks. This has led to rewrites of most storage stacks, cloud based or not.
Second, SSDs use extremely sophisticated log structured techniques to mask the cost of writes. Some of these techniques, such as data deduplication and compression only help certain workloads. The big question in this space is how higher-level database indexes will interact with the lower-level log structured system.
It could be that future hardware devices will mask the cost of random writes so well that higher-level log structured techniques will become redundant. On the other hand, higher-level log structured approaches have more computational resources at their disposal, and also have more information about the application. These advantages could mean that they will always be able to significantly improve upon hardware-based approaches.
Adam Silberstein, Yahoo! Research.
Research Area: Web Information Management.
My current interests are in large distributed data systems. My research interests are in the general area of large scale data management. Specifically, this includes both online transaction processing, and analytics, and bridging the gap between them,
as well as techniques for generating user feeds in social networks. I joined Yahoo! in August 2007, after finishing my Ph.D. at Duke University in February 2007.
Raghu Ramakrishnan, Yahoo! Research.
Raghu Ramakrishnan is Chief Scientist for Search and Cloud Platforms at Yahoo!, and is a Yahoo! Fellow, heading the Web Information Management research group. His work in database systems, with a focus on data mining, query optimization, and web-scale data management, has influenced query optimization in commercial database systems and the design of window functions in SQL:1999. His paper on the Birch clustering algorithm received the SIGMOD 10-Year Test-of-Time award, and he
has written the widely-used text “Database Management Systems” (with Johannes Gehrke).
His current research interests are in cloud computing, content optimization, and the development of a “web of concepts” that indexes all information on the web in semantically rich terms. Ramakrishnan has received several awards, including the ACM SIGKDD Innovations Award, the ACM SIGMOD Contributions Award, a Distinguished Alumnus Award from IIT Madras, a Packard Foundation Fellowship in Science and Engineering, and an NSF Presidential Young Investigator Award. He is a Fellow of the ACM and IEEE.
Ramakrishnan is on the Board of Directors of ACM SIGKDD, and is a past Chair of ACM SIGMOD and member of the Board of Trustees of the VLDB Endowment. He was Professor of Computer Sciences at the University of Wisconsin-Madison, and was founder and CTO of QUIQ, a company that pioneered crowd-sourcing, specifically question-answering communities, powering Ask Jeeves’ AnswerPoint as well as customer-support for companies such as Compaq.
Dr. R. G. G. “Rick” Cattell is an independent consultant in database systems and engineering management. He previously worked as a Distinguished Engineer at Sun Microsystems, mostly recently on open source database systems and horizontal database scaling. Dr. Cattell served for 20+ years at Sun Microsystems in management and senior technical roles,
and for 10+ years in research at Xerox PARC and at Carnegie-Mellon University.
Dr. Cattell is best known for his contributions to middleware and database systems, including database scalability, enterprise Java, object/relational mapping, object-oriented databases, and database interfaces. He is the author of several dozen papers and six books. He instigated Java DB and Java 2 Enterprise Edition, and was a contributor to a number of the Enterprise Java APIs and products. He previously led development of the Cedar DBMS at Xerox PARC, the Sun Simplify database GUI, and SunSoft’s ORB-database integration. He was a founder of SQL Access (a predecessor to ODBC), the founder and chair of the Object Data Management Group (ODMG), the co-creator of JDBC, the author of the world’s first monograph on object/relational and object databases, and a recipient of the ACM Outstanding PhD Dissertation Award.
 Benchmarking Cloud Serving Systems with YCSB. (.pdf)
Adam Silberstein, Brian F. Cooper, Raghu Ramakrishnan, Russell Sears, Erwin Tam, Yahoo! Research.
1st ACM Symposium on Cloud Computing, ACM, Indianapolis, IN, USA (June 10-11, 2010)
 PNUTS: Yahoo!’s Hosted Data Serving Platform (.pdf)
Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver and Ramana Yerneni, Yahoo! Research.
The paper describes PNUTS/Sherpa, Yahoo’s record-oriented cloud database.
Open Source Code
For further readings