Interview with Jonathan Ellis, project chair of Apache Cassandra.
“You’re going to see these databases attempting to make things easy that today are possible but difficult.” –Jonathan Ellis.
This interview is part of my series of interviews on the evolving market for Data Management Platforms. This time, I had the pleasure to interview Jonathan Ellis, project chair of Apache Cassandra.
RVZ: In my understanding of how the market of Data Management Platforms is evolving, I have identified three phases:
Phase I– New Proprietary data platforms developed: Amazon (Dynamo), Google (BigTable). Both systems remained proprietary and are in use by Amazon and Google.
Phase II- The advent of Open Source Developments: Apache projects such as Cassandra, Hadoop (MapReduce, Hive, Pig). Facebook and Yahoo! played major roles. Multitude of new data platforms emerged.
Phase III– Evolving Analytical Data Platforms. Hadoop for analytic. Companies such a Cloudera, but also IBM`s BigInsights are in this space.
Q1. Why Amazon and Google had to develop their own database systems? Why didn’t they use/”adjust” existing database systems?
Jonathan Ellis: Google and Amazon were really breaking new ground when they were working on Bigtable and Dynamo. The thing you have to remember is that the problem they were trying to solve was high-volume transactional systems: while companies like Teradata have been building large-scale databases for some time, these were analytical databases and not designed for high query volumes with low latency.
The state-of-the-art at the time for transactional systems was horizontal and vertical partitioning customized for a given application, built on a traditional database like Oracle. These systems were not application-transparent, meaning repartitioning was a major undertaking, nor were they reusable from one application to another.
Q2. I defined Phase II as the advent of Open Source Developments with Apache projects such as Cassandra, Hadoop (MapReduce, Hive, Pig). Facebook and Yahoo! played major roles. Multitude of new data platforms emerged. Any comment on this?
Jonathan Ellis: Note that Hadoop is a different kind of animal here: Hadoop *is* an analytical system, not a real–time or transaction oriented one.
Q3. How was it possible that Amazon and Google`s proprietary systems were used as input for Open Source Projects?
Jonathan Ellis: Google and Amazon both published whitepapers on Bigtable and Dynamo which were very influential for the open source systems that started appearing soon afterward. Since then, the open-source systems have of course continued to evolve; today Cassandra—which began as a fusion of Bigtable and Dynamo concepts—includes features described by neither of its ancestors, such as distributed counters and column indexes.
Q4. Why Facebook and LinkedIn developed Open Source data platforms and not proprietary software?
Jonathan Ellis: Probably a combination of two reasons:
– They recognized that with the kind of head start that Google and Amazon had, it would be difficult to achieve technical parity without leveraging the efforts of a larger development community.
– They don’t see infrastructure in and of itself as the place where they gain their competitive advantage. You can see more evidence of this in Facebook’s recent announcement that they were opening up the plans for their newest data center. So it goes beyond just source code.
Q5. Not everyone has data and scalability requirements such as Amazon and Google. Who currently needs such new data management platforms, and why?
Jonathan Ellis: Again, I think the piece that’s new here is the emphasis on high-volume transaction processing. Ten years ago you didn’t see this kind of urgency around transaction processing–some large web sites like eBay were concerned already, but it feels like there’s been a kind of Moore’s law of data growth that’s been catching more and more companies, both on the web and off. Today DataStax has customers like startups Backupify and Inkling, as well as companies you might expect to see like Netflix.
Q6. Is there a common denominator between the business models built around such open source projects?
Jonathan Ellis: At the most basic level, there’s only so many options to choose from.
You have services and support, and you have proprietary products built on top or around the open source core. Everyone is doing some combination of those.
Phase III– Evolving Analytical Data Platforms
Q7. Is Business Intelligence becoming more like Science for profit?
Jonathan Ellis: If you mean that a lot of teams are now trying to commercialize technologies that were originally developed without that kind of focus, then yes, we’ve definitely seen a lot of that the last couple years.
Q8. Who are the main actors in the Platform Ecosystem?
Jonathan Ellis: On the real-time side, Cassandra’s strongest competitors are probably Riak and HBase. Riak is backed by Basho, and I believe Cloudera supports HBase although it’s not their focus.
For analytics, everyone is standardizing on Hadoop, and there are a number of companies pushing that.
DataStax is unique here in that our just-released Brisk project gives you the best of both worlds: a Hadoop powered by Cassandra so you never have to do an ETL process before running an analytical query against your real-time data, while at the same time keeping those workloads separate so that they don’t interfere with each other.
Q9. What role will RDBMS play in the future? What about Object Databases, do they have a role o play?t
Jonathan Ellis: Relational databases will continue to be the main choice when you need ACID semantics and you have a relatively small data or query volume that you care about. Many of our customers continue to use a relational database in conjunction with Cassandra for things like user registration.
To be honest, I don’t see object databases being able to ride the NoSQL wave out of their niche. The popularity of NoSQL options isn’t from their rejection of the SQL language per se, but because that was part of what they left behind when they added features that are starting to matter even more than query language, primarily scalability.
Q10. Looking at three elements: Data, Platform, Analysis, what are the main research challenges ahead? And what are the main business challenges ahead?
Jonathan Ellis: I see the technical side as more engineering than R&D. You’re going to see these databases attempting to make things easy that today are possible but difficult. Cassandra’s column indexes are an example of this–you could use Cassandra to look up rows by column values in the past, but you had to maintain those indexes manually, in application code. Today Cassandra can automate that for you.
This ties into the business side as well: the challenge for everyone is to move beyond the early adopter market and go mainstream. Ease of use will be a big part of that.
Q11. What are the main future developments? Anything you wish to add?
Jonathan Ellis: This is an exciting space to work in right now because the more we build, the more we can see that we’ve barely scratched the surface so far. The feature we’re working on right now that I’m personally most excited about is predicate push-down for Brisk: allowing Hive, a data warehouse system for Hadoop, to take advantage of Cassandra column indexes.
Readers curious about Brisk–which is fully open-source–can learn more here.
Thanks for the questions!
Jonathan Ellis is CTO and co-founder of DataStax (formerly Riptano), the commercial leader in products and support for Apache Cassandra. Prior to DataStax, Jonathan built a multi-petabyte, scalable storage system based on Reed-Solomon encoding for backup provider Mozy. Jonathan is project chair of Apache Cassandra.