Hadoop and NoSQL: Interview with J. Chris Anderson
“The missing piece of the Hadoop puzzle is accounting for real time changes. Hadoop can give powerful analysis, but it is fundamentally a batch-oriented paradigm.” — J. Chris Anderson.
How is Hadoop related to NoSQL databases? What are the main performance bottlenecks of NoSQL data stores? On these topics I did interview, J. Chris Anderson co-founder of Couchbase.
RVZ
Q1. In order to analyze Big Data, the current state of the art is a parallel database or NoSQL data store, with a Hadoop connector.
What about performance issues arising with the transfer of large amounts of data between the two systems? Can the use of connectors introduce delays, data silos, increase TCO?
Chris Anderson : The missing piece of the Hadoop puzzle is accounting for real time changes. Hadoop can give powerful analysis, but it is fundamentally a batch-oriented paradigm. Couchbase is designed for real time applications (with all the different trade-offs that implies) yet also provides query-ability, so you can see inside the data as it changes.
We are seeing interesting applications where Couchbase is used to enhance the batch-based Hadoop analysis with real time information, giving the effect of a continuous process.
So hot data lives in Couchbase, in RAM (even replicas in RAM for HA fast-failover). You wouldn’t want to keep 3 copies of your Hadoop data in RAM, that’d be crazy.
But it makes sense for your working set.
And this solves the data transfer costs issue you mention, because you essentially move the data out of Couchbase into Hadoop when it cools off.
That’s much easier than maintaining parallel stores, because you only have to copy data from Couchbase to Hadoop as it passes out of the working set.
For folks working on problems like this, we have a Sqoop connector and we’ll be talking about it with Cloudera at our CouchConf in San Francisco on September 21.
Q2. Wouldn’t a united/integrated platform (data store + Hadoop) be a better solution instead?
Chris Anderson : It would be nice to have a unified query language and developer experience (not to mention goodies like automatically pulling data back
out of Hadoop into Couchbase when it comes back into the working set). I think everyone recognizes this.
We’ll get there, but in my opinion the primary interface will be via the real time store, and the Hadoop layer will become a commodity. That is why there is so much competition for the NoSQL brass ring right now.
Q3. Could you please explain in your opinion the tradeoff between scaling out and scaling up? What does it mean in practice for an application
domain?
Chris Anderson : Scaling up is easier from a software perspective. It’s essentially the Moore’s Law approach to scaling — buy a bigger box. Well, eventually you run out of bigger boxes to buy, and then you’ve run off the edge of a cliff. You’ve got to pray Moore keeps up.
Scaling out means being able to add independent nodes to a system. This is the real business case for NoSQL. Instead of being hostage to Moore’s Law, you can grow as fast as your data. Another advantage to adding independent nodes is you have more options when it comes to matching your workload. You have more flexibility when you are running on commodity hardware — you can run on SSDs or high-compute instances, in the cloud, or inside your firewall.
Q4. James Phillips a year ago said that “it is possible we will see standards begin to emerge, both in on-the-wire protocols and perhaps in query languages, allowing interoperability between NoSQL database technologies similar to the kind of interoperability we’ve seen with SQL and relational database technology.” What is your take now?
Chris Anderson : That hasn’t changed but the industry is still young and everyone is heads-down on things like reliability and operability. Once these products become more mature there will be time to think about standardization.
Q5. There is a scarcity of benchmarks to substantiate the many claims made of scalabilty of NoSQL vendors. NoSQL data stores do not qualify for the TPC-C benchmark, since they relax ACID transaction properties. How can you then measure and compare the performance of the various NoSQL data stores instead?
Chris Anderson : I agree. Vendors are making a lot of claims about latency, throughput and scalability without much proof. There are a few benchmarks starting
to trickle out from various third parties. Cisco and SolarFlare published one on Couchbase (See here ) and are putting other vendors through the same tests. I know there will be other third party benchmarks comparing Couchbase, MongoDB, and Cassandra that will be coming out soon. I think the Yahoo YCSB benchmarks will be another source of good comparisons.
There are bigger differences between vendors than people are aware of and we think many people will be surprised by the results.
Q6. What are in your opinion the main performance bottlenecks for NoSQL data stores?
Chris Anderson : The three classes of bottleneck correspond to the major areas of hardware: network, disk, and memory. Couchbase has historically been very
fast at the network layer – it’s based on Memcached which has had a ton of optimizations for interacting with network hardware and protocols.
We’re essentially as fast as one can get in the network layer, and I believe most NoSQL databases that use persistent socket connections are also free of significant network bottlenecks. So the network is only a bottleneck for REST or other stateless connection models.
Disk is always going to be the slowest component as far as the inherent latencies of non-volatile storage, but any high-performance database will paper over this by using some form of memory caching or memory-based storage. Couchbase has been designed specifically to decouple the disk from the rest of the system. In the extreme, we’ve seen customers survive prolonged disk outages while maintaining availability, as our memory layer keeps on trucking, even when disks become unresponsive. (This was during the big Amazon EBS outage that left a lot of high-profile sites down due to database issues.)
Memory may be the most interesting bottleneck, because it is the source of non-determinism in performance. So if you are choosing a database for performance reasons you’ll want to take a look at how it’s memory layer is architected.
Is it decoupled from the disk? Is it free of large locks that can pause unrelated queries as the engine modifies in-memory data structures? Over time does it continue to perform, or does the memory layout become fragmented? These are all problems we’ve been working on for a long time at Couchbase, and we are pretty happy with where we stand.
Q7. Couchbase is the result of the merger (more then one year ago) of CouchOne(document store) and Membase (key-value store).How has your product offering changed since the merge happened?
Chris Anderson : Our product offering hasn’t changed a bit since the merger. The current GA product, Couchbase Server 1.8.1, is essentially a continuation of the Membase Server 1.7 product line. It is a key value database using the
binary-memcached interface. It’s in use all around the web, like Orbitz, LinkedIn, AOL, Zynga, and lots of other companies.
With our upcoming 2.0 release, we are expanding from a key value database to a document database. This means adding query and index support. We’re even including an Elastic Search adapter and experimental geographic indices. In addition, 2.0 adds cross-datacenter replication support so you can provide high-performance access to the data at multiple points-of-presence.
Q8. How do you implement concurrency in Couchbase?
Chris Anderson : Each node is inherently independent, and there are no special nodes, proxies, or gatekeepers. The client drivers running inside your application server connect directly to the data-node for a particular item, which gives low-latency but also greater concurrency, a given application server will be talking to multiple backend database nodes at any given time.
For the memory layer, we are based on memcached, which has a long history of concurrency optimizations. We support the full memcached feature set, so operations like CAS write, INCR and DECR are available, which is great for concurrent workloads. We’ve also added some extensions for locking, which facilitates reading and updating an object-graph that is spread across multiple keys.
At the disk layer, for Couchbase Server 2.0, we are moving away from SQLite, toward our own highly concurrent storage format. We’re append-only, so once data is written to disk, there’s no chance of corruption. The other advantage of tail-append writes is that you can do all kinds of concurrent reads of the file, even while writes are happening. For instance a backup can be as easy as `cp` or `rsync` (although we provide tools to manage backing up an entire cluster).
Q9. Couchbase does not support SQL queries, how do you query your database instead? What are the lessons learned so far from your users?
Chris Anderson : Our incremental indexing system is designed to be native to our JSON storage format. So the user writes JavaScript code to inspect the document, and pick out which data to use as the index key. It’s essentially putting the developer directly in touch with the index data structure, so while it sounds primitive, there is a ton of flexibility to be had there. We’ve got a killer optimization for aggregation operations, so if you’ve ever been burned by a slow GROUP BY query, you might want to take a look.
Despite all the power, we know users are also looking for more traditional query approaches. We’re working on a few things in this area.
The first one you will see is our Elastic Search integration, which will simplify querying your Couchbase cluster. Elastic Search provides a JSON-style query API, and we’ve already seen many of our users integrate it with Couchbase, so we are building an official adapter to better support this use case.
Q10 How do you handle both structured and unstructured data at scale?
Chris Anderson : At scale, all data is messy. I’ve seen databases in the 4th normal form accumulate messy errors, so a schema isn’t always a protection. At scale, it’s all about discovering the schema, not imposing it.
If you fill a Couchbase cluster with enough tweets, wall posts, and Instagram photos, you’ll start to see that even though these APIs all have different JSON formats, it’s not hard to pick out some things in common. With our flexible indexing system, we see users normalize data after the fact, so they can query heterogeneous data, and have it show up as a result set that is easy for the application to work with.
This fits with the overall model of a document database: rather than try to “model the world” with a relational schema, your aim becomes to “capture the user’s intent” and make sense of it later. When your goal is to scale up to tens of millions of users (in maybe only a few days), the priority becomes to capture the data and provide a compelling high-performance user experience.
Q11 Couchbase is sponsor of the Couchbase Server NoSQL database open source project. How do you ensure participation of the open source developers community and avoid incompatible versions of the system? Are you able to with an open source project to produce a highly performant system?
Chris Anderson : All our code is available under the Apache 2.0 license, and I’d wager that we’ve never heard of the majority of our users, much less asked them
for money. When someone’s business depends on Couchbase, that’s when they come to us, so I’m comfortable with the open source business model. It has some distinct advantages over the Oracle style license model.
The engineering side of me admits that we haven’t always been the best at engendering participation in the Couchbase open source project. A few months ago, if you tried to compile your own copy of Couchbase, e.g. to provide a patch or bug-fix, you’d be on a wild goose chase for days before you got to the fun part. We’ve fixed that now, but it’s worth noting that open-source friction hurts in more ways than one, as smoothing the path for new external contributors also means new engineering hires can get productive on our tool-chain faster.
So I’ve taken a personal interest in our approach to external contributions. The first step is cleaning up the on-ramps, as we already have decent docs for contributing, we just need to make them more prominent. The goal is to have a world-class open-source contributor experience, and we don’t take it lightly.
We do have a history of *very open development* which I am proud of. Not only can you see all the patches that make it into the code base, you can see all the patches that don’t make it through code review, and the comments on them as they are happening. Check out here for an example of how to do open development right.
Q12 How do you compare NewSQL databases, claming to offer both ACID-transaction properties and scalability, with Couchbase?
Chris Anderson : The CAP theorem is well-known these days, so I don’t need to go into the
reasons why NewSQL is an uphill battle. I see NewSQL technologies as primarily aimed at developers who don’t want to learn the lessons about data that the first decade of the web taught us. So there will likely be a significant revenue stream based on applications that need to be scaled up, without being re-written.
But we’ve asked our users what they like about NoSQL, and one of the biggest answers, as important to most of them as scalability and performance, was that they felt more productive with a JSON document model, than with rigid schemas. Requirements change quickly and being able to change your application without working through a DBA is a real
advantage. So I think NewSQL is really for legacy applications, and the future is with NoSQL document databases.
Q13 No single data store is best for all uses. What are the applications that are best suited for Couchbase, and which ones are not?
Chris Anderson : I got into Couch in the first place, because I see document databases as the 80% solution. It might take a few years, but I fully expect the
schema-free document model to be ascendant over the relational model, for most applications, most of the time. Of course there will still be uses for relational databases, but the majority of applications being written these days could work just fine on a document database, and as developer preferences change, we’ll see more and more serious applications running on NoSQL.
So from that angle, you can see that I think Couchbase is a broad solution.
We’ve seen great success in everything from content management systems to ad-targeting platforms, as well as simpler things like providing a high-performance data store for CRUD operations, or more enterprise-focused use cases like offloading query volume from mainframe applications.
A geekier way to answer your question, is to talk about the use cases where you pretty much don’t have a choice, it’s Couchbase or bust. So those use cases are anything where you need consistent sub-millisecond access latency, for instance maybe you have a time-based service level agreement.
If you need consistent high-performance while you are scaling your database cluster, that’s when you need Couchbase.
So for instance social-gaming has been a great vertical for us, since the hallmark of a successful game is that it may go from obscurity to a household name in just a few weeks. Too many of those games crash and burn when they are running on previous-generation technology.
It’s always been possible to build a backend that can handle internet-scale applications, what’s new with Couchbase is the ability to scale from a relatively small backend, to a backend handling hundreds of thousands of operations per second, at the drop of a hat.
Again, the use-cases where it’s Couchbase or bust are a subset of what our users are doing, but they are a great way to illustrate our priorities. We have plenty of users who don’t have high-performance requirements, but they still enjoy the flexibility of a document database.
If you need transactions across multiple data items, with atomic rollback, then you should be using a relational database. If you need foreign-key constraints, you should be using a relational database.
However, before you make that decision, you may want to ask if the performance tradeoffs are worth it for your application. Often there is a way to redesign something to make it less dependent on schemas, and the business benefits from the increased scale and performance you can get from NoSQL may make it a worthwhile tradeoff.
_________________________
Chris Anderson is a co-founder of Couchbase, focussed on the overall developer experience. In the early days of the company he served as CFO and President, but never strayed too far from writing code. His background includes open-source contributions to the Apache Software Foundation as well as many other projects. Before he wrote database engine code, he cut his teeth building applications and spidering the web. These days he gets a kick out of Node.js, playing bass guitar, and enjoying family time.
======
Related Posts
– Measuring the scalability of SQL and NoSQL systems. by Roberto V. Zicari on May 30, 2011
– Next generation Hadoop — interview with John Schroeder. by Roberto V. Zicari on September 7, 2012
Resources
ODBMS.org resources on:
– Big Data and Analytical Data Platforms: Blog Posts | Free Software | Articles | PhD and Master Thesis|