AsterixDB: Better than Hadoop? Interview with Mike Carey
“To distinguish AsterixDB from current Big Data analytics platforms – which query but don’t store or manage Big Data – we like to classify AsterixDB as being a “Big Data Management System” (BDMS, with an emphasis on the “M”)”–Mike Carey.
The AsterixDB Big Data Management System (BDMS) is the result of approximately four years of R&D involving researchers at UC Irvine, UC Riverside, and Oracle Labs. The AsterixDB code base currently consists of over 250K lines of Java code that has been co-developed by project staff and students at UCI and UCR.
The AsterixDB project has been supported by the U.S. National Science Foundation as well as by several generous industrial gifts.
Q1. Why build a new Big Data Management System?
Mike Carey: When we started this project in 2009, we were looking at a “split universe” – there were your traditional parallel data warehouses, based on expensive proprietary relational DBMSs, and then there was the emerging Hadoop platform, which was free but low-function in comparison and wasn’t based on the many lessons known to the database community about how to build platforms to efficiently query large volumes of data. We wanted to bridge those worlds, and handle “modern data” while we were at it, by taking into account the key lessons from both sides.
To distinguish AsterixDB from current Big Data analytics platforms – which query but don’t store or manage Big Data – we like to classify AsterixDB as being a “Big Data Management System” (BDMS, with an emphasis on the “M”).
We felt that the Big Data world, once the initial Hadoop furor started to fade a little, would benefit from having a platform that could offer things like:
- a flexible data model that could handle data scenarios ranging from “schema first” to “schema never”;
- a full query language with at least the expressive power of SQL;
- support for data storage, data management, and automatic indexing;
- support for a wide range of query sizes, with query processing cost being proportional to the given query;
- support for continuous data ingestion, hence the accumulation of Big Data;
- the ability to scale up gracefully to manage and query very large volumes of data using commodity clusters; and,
- built-in support for today’s common “Big Data data types”, such as textual, temporal, and simple spatial data.
So that’s what we set out to do.
Q2. What was wrong with the current Open Source Big Data Stack?
Mike Carey: First, we should mention that some reviewers back in 2009 thought we were crazy or stupid (or both) to not just be jumping on the Hadoop bandwagon – but we felt it was important, as academic researchers, to look beyond Hadoop and be asking the question “okay, but after Hadoop, then what?”
We recognized that MapReduce was great for enabling developers to write massively parallel jobs against large volumes of data without having to “think parallel” – just focusing on one piece of data (map) or one key-sharing group of data (reduce) at a time. As a platform for “parallel programming for dummies”, it was (and still is) very enabling! It also made sense, for expedience, that people were starting to offer declarative languages like Pig and Hive, compiling them down into Hadoop MapReduce jobs to improve programmer productivity – raising the level much like what the database community did in moving to the relational model and query languages like SQL in the 70’s and 80’s.
One thing that we felt was wrong for sure in 2009 was that higher-level languages were being compiled into an assembly language with just two instructions, map and reduce. We knew from Tedd Codd and relational history that more instructions – like the relational algebra’s operators – were important – and recognized that the data sorting that Hadoop always does between map and reduce wasn’t always needed.
Trying to simulate everything with just map and reduce on Hadoop made “get something better working fast” sense, but not longer-term technical sense. As for HDFS, what seemed “wrong” about it under Pig and Hive was its being based on giant byte stream files and not on “data objects”, which basically meant file scans for all queries and lack of indexing. We decided to ask “okay, suppose we’d known that Big Data analysts were going to mostly want higher-level languages – what would a Big Data platform look like if it were built ‘on purpose’ for such use, instead of having incrementally evolved from HDFS and Hadoop?”
Again, our idea was to try and bring together the best ideas from both the database world and the distributed systems world. (I guess you could say that we wanted to build a Big Data Reese’s Cup… J)
Q3. AsterixDB has been designed to manage vast quantities of semi-structured data. How do you define semi-structured data?
Mike Carey: In the late 90’s and early 2000’s there was a bunch of work on that – on relaxing both the rigid/flat nature of the relational model as well as the requirement to have a separate, a priori specification of the schema (structure) of your data. We felt that this flexibility was one of the things – aside from its “free” price point – drawing people to the Hadoop ecosystem (and the key-value world) instead of the parallel data warehouse ecosystem.
In the Hadoop world you can start using your data right away, without spending 3 months in committee meetings to decide on your schema and indexes and getting DBA buy-in. To us, semi-structured means schema flexibility, so in AsterixDB, we let you decide how much of your schema you have to know and/or choose to reveal up front, and how much you want to leave to be self-describing and thus allow it to vary later. And it also means not requiring the world to be flat – so we allow nesting of records, sets, and lists. And it also means dealing with textual data “out of the box”, because there’s so much of that now in the Big Data world.
Q4. The motto of your project is “One Size Fits a Bunch”. You claim that AsterixDB can offer better functionality, managability, and performance than gluing together multiple point solutions (e.g., Hadoop + Hive + MongoDB). Could you please elaborate on this?
Mike Carey: Sure. If you look at current Big Data IT infrastructures, you’ll see a lot of different tools and systems being tied together to meet an organization’s end-to-end data processing requirements. In between systems and steps you have the glue – scripts, workflows, and ETL-like data transformations – and if some of the data needs to be accessible faster than a file scan, it’s stored not just in HDFS, but also in a document store or a key-value store.
This just seems like too many moving parts. We felt we could build a system that could meet more (not all!) of today’s requirements, like the ones I listed in my answer to the first question.
If your data is in fewer places or can take a flight with fewer hops to get the answers, that’s going to be more manageable – you’ll have fewer copies to keep track of and fewer processes that might have hiccups to watch over. If you can get more done in one system, obviously that’s more functional. And in terms of performance, we’re not trying to out-perform the specialty systems – we’re just trying to match them on what each does well. If we can do that, you can use our new system without needing as many puzzle pieces and can do so without making a performance sacrifice.
We’ve recently finished up a first comparison of how we perform on tasks that systems like parallel relational systems, MongoDB, and Hive can do – and things look pretty good so far for AsterixDB in that regard.
Q5. AsterixDB has been combining ideas from three distinct areas — semi-structured data management, parallel databases, and data-intensive computing. Could you please elaborate on that?
Mike Carey: Our feeling was that each of these areas has some ideas that are really important for Big Data. Borrowing from semi-structured data ideas, but also more traditional databases, leads you to a place where you have flexibility that parallel databases by themselves do not. Borrowing from parallel databases leads to scale-out that semi-structured data work didn’t provide (since scaling is orthogonal to data model) and with query processing efficiencies that parallel databases offer through techniques like hash joins and indexing – which MapReduce-based data-intensive computing platforms like Hadoop and its language layers don’t give you. Borrowing from the MapReduce world leads to the open-source “pricing” and flexibility of Hadoop-based tools, and argues for the ability to process some of your queries directly over HDFS data (which we call “external data” in AsterixDB, and do also support in addition to managed data).
Q6. How does the AsterixDB Data Model compare with the data models of NoSQL data stores, such as document databases like MongoDB and CouchBase, simple key/value stores like Riak and Redis, and column-based stores like HBase and Cassandra?
Mike Carey: AsterixDB’s data model is flexible – we have a notion of “open” versus “closed” data types – it’s a simple idea but it’s unique as far as we know. When you define a data type for records to be stored in an AsterixDB dataset, you can choose to pre-define any or all of the fields and types that objects to be stored in it will have – and if you mark a given type as being “open” (or let the system default it to “open”), you can store objects there that have those fields (and types) as well as any/all other fields that your data instances happen to have at insertion time.
Or, if you prefer, you can mark a type used by a dataset as “closed”, in which case AsterixDB will make sure that all inserted objects will have exactly the structure that your type definition specifies – nothing more and nothing less.
(We do allow fields to be marked as optional, i.e., nullable, if you want to say something about their type without mandating their presence.)
What this gives you is a choice! If you want to have the total, last-minute flexibility of MongoDB or Couchbase, with your data being self-describing, we support that – you don’t have to predefine your schema if you use data types that are totally open. (The only thing we insist on, at the moment, is that every type must have a key field or fields – we use keys when sharding datasets across a cluster.)
Structurally, our data model was JSON-inspired – it’s essentially a schema language for a JSON superset – so we’re very synergistic with MongoDB or Couchbase data in that regard.
On the other end of the spectrum, if you’re still a relational bigot, you’re welcome to make all of your data types be flat – don’t use features like nested records, lists, or bags in your record definitions – and mark them all as “closed” so that your data matches your schema. With AsterixDB, we can go all the way from traditional relational to “don’t ask, don’t tell”. As for systems with BigTable-like “data models” – I’d personally shy away from calling those “data models”.
Q7. How do you handle horizontal scaling? And vertical scaling?
Mike Carey: We scale out horizontally using the same sort of divide-and-conquer techniques that have been used in commercial parallel relational DBMSs for years now, and more recently in Hadoop as well. That is, we horizontally partition both data (for storage) and queries (when processed) across the nodes of commodity clusters. Basically, our innards look very like those of systems such as Teradata or Parallel DB2 or PDW from Microsoft – we use join methods like parallel hybrid hash joins, and we pay attention to how data is currently partitioned to avoid unnecessary repartitioning – but have a data model that’s way more flexible. And we’re open source and free….
We scale vertically (within one node) in two ways. First of all, we aren’t memory-dependent in the way that many of the current Big Data Analytics solutions are; it’s not that case that you have to buy a big enough cluster so that your data, or at least your intermediate results, can be memory-resident.
Instead, our physical operators (for joins, sorting, aggregation, etc.) all spill to disk if needed – so you can operate on Big Data partitions without getting “out of memory” errors. The other way is that we allow nodes to hold multiple partitions of data; that way, one can also use multi-core nodes effectively.
Q8. What performance figures do you have for AsterixDB?
Mike Carey: As I mentioned earlier, we’ve completed a set of initial performance tests on a small cluster at UCI with 40 cores and 40 disks, and the results of those tests can be found in a recently published AsterixDB overview paper that’s hanging on our project web site’s publication page (http://asterixdb.ics.uci.edu/publications.html).
We have a couple of other performance studies in flight now as well, and we’ll be hanging more information about those studies in the same place on our web site when they’re ready for human consumption. There’s also a deeper dive paper on the AsterixDB storage manager that has some performance results regarding the details of scaling, indexing, and so on; that’s available on our web site too. The quick answer to “how does AsterixDB perform” is that we’re already quite competitive with other systems that have narrower feature sets – which we’re pretty proud of.
Q9. You mentioned support for continuous data ingestion. How does that work?
Mike Carey: We have a special feature for that in AsterixDB – we have a built-in notion of Data Feeds that are designed to simplify the lives of users who want to use our system for warehousing of continuously arriving data.
We provide Data Feed adaptors to enable outside data sources to be defined and plugged in to AsterixDB, and then one can “connect” a Data Feed to an AsterixDB data set and the data will start to flow in. As the data comes in, we can optionally dispatch a user-defined function on each item to do any initial information extraction/annotation that you want. Internally, this creates a long-running job that our system monitors – if data starts coming too fast, we offer various policies to cope with it, ranging from discarding data to sampling data to adding more UDF computation tasks (if that’s the bottleneck). More information about this is available in the Data Feeds tech report on our web site, and we’ll soon be documenting this feature in the downloadable version of AsterixDB. (Right now it’s there but “hidden”, as we have been testing it first on a set of willing UCI student guinea pigs.)
Q10. What is special about the AsterixDB Query Language? Why not use SQL?
Mike Carey: When we set out to define the query language for AsterixDB, we decided to define our own new language – since it seemed like everybody else was doing that at the time (witness Pig, Jaql, HiveQL, etc.) – one aimed at our data model.
SQL doesn’t handle nested or open data very well, so extending ANSI/ISO SQL seemed like a non-starter – that was also based on some experience working on SQL3 in the late 90’s. (Take a look at Oracle’s nested tables, for example.). Based on our team’s backgrounds in XML querying, we actually started there – XQuery was developed by a team of really smart people from the SQL world (including Don Chamberlin, father of SQL) as well as from the XML world and the functional programming world – so we started there. We took XQuery and then started throwing the stuff overboard that wasn’t needed for JSON or that seemed like a poor feature that had been added for XPath compatibility.
What remained was AQL, and we think it’s a pretty nice language for semistructured data handling. We periodically do toy with the notion of adding a SQL-like re-skinning of AQL to make SQL users feel more at home – and we may well do that in the future – but that would be different than “real SQL”. (The N1QL effort at Couchbase is doing something along those lines, language-wise, as an example. The SQL++ design from UCSD is another good example there.)
Q11. What level of concurrency and recovery guarantees does AsterixDB offer?
Mike Carey: We offer transaction support that’s akin to that of current NoSQL stores. That is, we promise record-level ACIDity – so inserting or deleting a given record will happen as an atomic, durable action. However, we don’t offer general-purpose distributed transactions. We support an arbitrary number of secondary indexes on data sets, and we’ll keep all the indexes on a data set transactionally consistent – that we can do because secondary index entries for a given record live in the same data partition as the record itself, so those transactions are purely local.
Q12. How does AsterixDB compare with Hadoop? What about Hadoop Map/Reduce compatibility?
Mike Carey: I think we’ve already covered most of that – Hadoop MapReduce is an answer to low-level “parallel programming for dummies”, and it’s great for that – and languages on top like Pig Latin and HiveQL are better programming abstractions for “data tasks” but have runtimes that could be much better. We started over, much as the recent flurry of Big Data analytics platforms are now doing (e.g., Impala, Spark, and friends), but with a focus on scaling to memory-challenging data sizes. We do have a MapReduce compatibility layer that goes along with our Hyracks runtime layer – Hyracks is name of our internal dataflow runtime layer – but our MapReduce compatibility layer is not related to (or connected to) the AsterixDB system.
Q13. How does AsterixDB relate to Hadapt?
Mike Carey: I’m not familiar with Hadapt, per se, but I read the HadoopDB work that fed into it.
We’re architecturally very different – we’re not Hadoop-based at all – I’d say that HadoopDB was more of an expedient hybrid coupling of Hadoop and databases, to get some of the indexing and local query efficiency of an existing database engine quickly in the Hadoop world. We were thinking longer term, starting from first principles, about what a next-generation BDMS might look like. AsterixDB is what we came up.
Q14. How does AsterixDB relate to Spark?
Mike Carey: Spark is aimed at fast Big Data analytics – its data is coming from HDFS, and the task at hand is to scan and slice and dice and process that data really fast. Things like Shark and SparkSQL give users SQL query power over the scanned data, but Spark in general is really catching fire, it appears, due to its applicability to Big Machine Learning tasks. In contrast, we’re doing Big Data Management – we store and index and query Big Data. It would be a very interesting/useful exercise for us to explore how to make AsterixDB another source where Spark computations can get input data from and send their results to, as we’re not targeting the more complex, in-memory computations that Spark aims to support.
Q15. How can others contribute to the project?
Mike Carey: We would love to see this start happening – and we’re finally feeling more ready for that, and even have some NSF funding to make AsterixDB something that others in the Big Data community can utilize and share.
(Note that our system is Apache-style open source licensed, so there are no “gotchas” lurking there.)
Some possibilities are:
(1) Others can start to use AsterixDB to do real exploratory Big Data projects, or to teach about Big Data (or even just semistructured data) management. Each time we’ve worked with trial users we’ve gained some insights into our feature set, our query optimizations, and so on – so this would help contribute by driving us to become better and better over time.
(2) Folks who are studying specific techniques for dealing with modern data – e.g., new structures for indexing spatiotemporaltextual (J) data – might consider using AsterixDB as a place to try out their new ideas.
(This is not for the meek, of course, as right now effective contributors need to be good at reading and understanding open source software without the benefit of a plethora of internal design documents or other hints.) We also have some internal wish lists of features we wish we had time to work on – some of which are even doable from “outside”, e.g., we’d like to have a much nicer browser-based workbench for users to use when interacting with and managing an AsterixDB cluster.
(3) Students or other open source software enthusiasts who download and try our software and get excited about it – who then might want to become an extension of our team – should contact us and ask about doing so. (Try it first, though!) We would love to have more skilled hands helping with fixing bugs, polishing features, and making the system better – it’s tough to build robust software in a university setting, and we would especially welcome contributors from companies.
Thanks very much for this opportunity to share what we’ve being doing!
Michael J. Carey is a Bren Professor of Information and Computer Sciences at UC Irvine.
Before joining UCI in 2008, Carey worked at BEA Systems for seven years and led the development of BEA’s AquaLogic Data Services Platform product for virtual data integration. He also spent a dozen years teaching at the University of Wisconsin-Madison, five years at the IBM Almaden Research Center working on object-relational databases, and a year and a half at e-commerce platform startup Propel Software during the infamous 2000-2001 Internet bubble. Carey is an ACM Fellow, a member of the National Academy of Engineering, and a recipient of the ACM SIGMOD E.F. Codd Innovations Award. His current interests all center around data-intensive computing and scalable data management (a.k.a. Big Data).
– AsterixDB Big Data Management System (BDMS): Downloads, Documentation, Asterix Publications.
Follow ODBMS.org on Twitter: @odbmsorg