{"id":3501,"date":"2014-10-22T18:18:28","date_gmt":"2014-10-22T18:18:28","guid":{"rendered":"http:\/\/www.odbms.org\/blog\/?p=3501"},"modified":"2014-10-22T18:32:05","modified_gmt":"2014-10-22T18:32:05","slug":"asterixdb-hadoop-interview-mike-carey","status":"publish","type":"post","link":"https:\/\/www.odbms.org\/blog\/2014\/10\/asterixdb-hadoop-interview-mike-carey\/","title":{"rendered":"AsterixDB: Better than Hadoop? Interview with Mike Carey"},"content":{"rendered":"<blockquote><p><strong>&#8220;To distinguish AsterixDB from current Big Data analytics platforms \u2013 which query but don&#8217;t store or manage Big Data \u2013 we like to classify AsterixDB as being a \u201cBig Data Management System\u201d (BDMS, with an emphasis on the \u201cM\u201d)&#8221;&#8211;Mike Carey.<\/strong><\/p><\/blockquote>\n<p><a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.odbms.org\/2014\/03\/michael-j-carey-uc-irvine\/');\"  href=\"http:\/\/www.odbms.org\/2014\/03\/michael-j-carey-uc-irvine\/\" target=\"_blank\"><strong>Mike Carey<\/strong><\/a> and his colleagues have been working on a new data management system for Big Data called <strong><a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.odbms.org\/2014\/05\/asterixdb-big-data-management-system-bdms\/');\"  href=\"http:\/\/www.odbms.org\/2014\/05\/asterixdb-big-data-management-system-bdms\/\" target=\"_blank\">AsterixDB<\/a><\/strong>.<\/p>\n<p>The AsterixDB Big Data Management System (BDMS) is the result of approximately four years of R&amp;D involving researchers at UC Irvine, UC Riverside, and Oracle Labs. The AsterixDB code base currently consists of over 250K lines of Java code that has been co-developed by project staff and students at UCI and UCR.<\/p>\n<p>The AsterixDB project has been supported by the U.S. National Science Foundation as well as by several generous industrial gifts.<\/p>\n<p>RVZ<\/p>\n<p><strong>Q1. Why build a new Big Data Management System?<\/strong><\/p>\n<p><strong>Mike Carey: <\/strong>When we started this project in 2009, we were looking at a \u201csplit universe\u201d \u2013 there were your traditional parallel data warehouses, based on expensive proprietary relational DBMSs, and then there was the emerging Hadoop platform, which was free but low-function in comparison and wasn\u2019t based on the many lessons known to the database community about how to build platforms to efficiently query large volumes of data.\u00a0We wanted to bridge those worlds, and handle \u201cmodern data\u201d while we were at it, by taking into account the key lessons from both sides.<\/p>\n<p>To distinguish AsterixDB from current Big Data analytics platforms \u2013 which query but don&#8217;t store or manage Big Data \u2013 we like to classify AsterixDB as being a \u201cBig Data Management System\u201d (BDMS, with an emphasis on the \u201cM\u201d).\u00a0<br \/>\nWe felt that the Big Data world, once the initial Hadoop furor started to fade a little, would benefit from having a platform that could offer things like:<\/p>\n<ul>\n<li>a flexible data model that could handle data scenarios ranging from &#8220;schema first&#8221; to &#8220;schema never&#8221;;<\/li>\n<li>a full query language with at least the expressive power of SQL;<\/li>\n<li>support for data storage, data management, and automatic indexing;<\/li>\n<li>support for a wide range of query sizes, with query processing cost being proportional to the given query;<\/li>\n<li>support for continuous data ingestion, hence the accumulation of Big Data;<\/li>\n<li>the ability to scale up gracefully to manage and query very large volumes of data using commodity clusters; and,<\/li>\n<li>built-in support for today&#8217;s common &#8220;Big Data data types&#8221;, such as textual, temporal, and simple spatial data.<\/li>\n<\/ul>\n<p>So that\u2019s what we set out to do.<\/p>\n<p><strong>Q2. What was wrong with the current Open Source Big Data Stack?<\/strong><\/p>\n<p><strong>Mike Carey:\u00a0<\/strong>First, we should mention that some reviewers back in 2009 thought we were crazy or stupid (or both) to not just be jumping on the Hadoop bandwagon \u2013 but we felt it was important, as academic researchers, to look beyond Hadoop and be asking the question \u201cokay, but after Hadoop, then what?\u201d\u00a0<br \/>\nWe recognized that <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/en.wikipedia.org\/wiki\/MapReduce');\"  href=\"http:\/\/en.wikipedia.org\/wiki\/MapReduce\" target=\"_blank\">MapReduce<\/a> was great for enabling developers to write massively parallel jobs against large volumes of data without having to \u201cthink parallel\u201d \u2013 just focusing on one piece of data (map) or one key-sharing group of data (reduce) at a time. As a platform for \u201cparallel programming for dummies\u201d, it was (and still is) very enabling!\u00a0It also made sense, for expedience, that people were starting to offer declarative languages like <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/pig.apache.org');\"  href=\"http:\/\/pig.apache.org\" target=\"_blank\">Pig<\/a> and <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/hive.apache.org');\"  href=\"https:\/\/hive.apache.org\" target=\"_blank\">Hive<\/a>, compiling them down into Hadoop MapReduce jobs to improve programmer productivity \u2013 raising the level much like what the database community did in moving to the relational model and query languages like SQL in the 70\u2019s and 80\u2019s.<\/p>\n<p>One thing that we felt was wrong for sure in 2009 was that higher-level languages were being compiled into an assembly language with just two instructions, map and reduce. We knew from <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/en.wikipedia.org\/wiki\/Edgar_F._Codd');\"  href=\"http:\/\/en.wikipedia.org\/wiki\/Edgar_F._Codd\" target=\"_blank\">Tedd Codd<\/a> and relational history that more instructions \u2013 like the relational algebra\u2019s operators \u2013 were important \u2013 and recognized that the data sorting that Hadoop always does between map and reduce wasn\u2019t always needed.\u00a0<br \/>\nTrying to simulate everything with just map and reduce on Hadoop made \u201cget something better working fast\u201d sense, but not longer-term technical sense.\u00a0As for <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/hadoop.apache.org\/docs\/r1.2.1\/hdfs_design.html');\"  href=\"http:\/\/hadoop.apache.org\/docs\/r1.2.1\/hdfs_design.html\" target=\"_blank\">HDFS<\/a>, what seemed \u201cwrong\u201d about it under Pig and Hive was its being based on giant byte stream files and not on \u201cdata objects\u201d, which basically meant file scans for all queries and lack of indexing.\u00a0We decided to ask \u201cokay, suppose we\u2019d known that Big Data analysts were going to mostly want higher-level languages \u2013 what would a Big Data platform look like if it were built \u2018on purpose\u2019 for such use, instead of having incrementally evolved from HDFS and Hadoop?\u201d<\/p>\n<p>Again, our idea was to try and bring together the best ideas from both the database world and the distributed systems world.\u00a0(I guess you could say that we wanted to build a Big Data Reese\u2019s Cup&#8230; J)<\/p>\n<p><strong>Q3. AsterixDB has been designed to manage vast quantities of semi-structured data.\u00a0How do you define semi-structured data?<\/strong><\/p>\n<p><strong>Mike Carey:\u00a0<\/strong>In the late 90\u2019s and early 2000\u2019s there was a bunch of work on that \u2013 on relaxing both the rigid\/flat nature of the relational model as well as the requirement to have a separate, a priori specification of the schema (structure) of your data.\u00a0We felt that this flexibility was one of the things \u2013 aside from its \u201cfree\u201d price point \u2013 drawing people to the <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/hadoop.apache.org');\"  href=\"http:\/\/hadoop.apache.org\" target=\"_blank\">Hadoop ecosystem<\/a> (and the key-value world) instead of the parallel data warehouse ecosystem.<br \/>\nIn the Hadoop world you can start using your data right away, without spending 3 months in committee meetings to decide on your schema and indexes and getting DBA buy-in.\u00a0To us, semi-structured means schema flexibility, so in AsterixDB, we let you decide how much of your schema you have to know and\/or choose to reveal up front, and how much you want to leave to be self-describing and thus allow it to vary later.\u00a0And it also means not requiring the world to be flat \u2013 so we allow nesting of records, sets, and lists.\u00a0And it also means dealing with textual data \u201cout of the box\u201d, because there\u2019s so much of that now in the Big Data world.<\/p>\n<p><strong>Q4. The motto of your project is <em>&#8220;One Size Fits a Bunch&#8221;<\/em>.\u00a0You claim that AsterixDB can offer better functionality, managability, and performance than gluing together multiple point solutions (e.g., Hadoop + Hive + MongoDB).\u00a0 Could you please elaborate on this?<\/strong><\/p>\n<p><strong>Mike Carey:\u00a0<\/strong>Sure.\u00a0If you look at current Big Data IT infrastructures, you\u2019ll see a lot of different tools and systems being tied together to meet an organization\u2019s end-to-end data processing requirements.\u00a0In between systems and steps you have the glue \u2013 scripts, workflows, and ETL-like data transformations \u2013 and if some of the data needs to be accessible faster than a file scan, it\u2019s stored not just in HDFS, but also in a document store or a key-value store.<br \/>\nThis just seems like too many moving parts.\u00a0We felt we could build a system that could meet more (not all!) of today\u2019s requirements, like the ones I listed in my answer to the first question.<br \/>\nIf your data is in fewer places or can take a flight with fewer hops to get the answers, that\u2019s going to be more manageable \u2013 you\u2019ll have fewer copies to keep track of and fewer processes that might have hiccups to watch over.\u00a0If you can get more done in one system, obviously that\u2019s more functional.\u00a0And in terms of performance, we\u2019re not trying to out-perform the specialty systems \u2013 we\u2019re just trying to match them on what each does well. If we can do that, you can use our new system without needing as many puzzle pieces and can do so without making a performance sacrifice.<br \/>\nWe\u2019ve recently finished up a first comparison of how we perform on tasks that systems like parallel relational systems, <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.mongodb.org');\"  href=\"http:\/\/www.mongodb.org\" target=\"_blank\">MongoDB<\/a>, and <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/hive.apache.org');\"  href=\"https:\/\/hive.apache.org\" target=\"_blank\">Hive<\/a> can do \u2013 and things look pretty good so far for AsterixDB in that regard.<\/p>\n<p><strong><i>Q5. AsterixDB has been combining ideas from three distinct areas &#8212; semi-structured data management, parallel databases, and data-intensive computing. Could you please elaborate on that?<\/i><\/strong><\/p>\n<p><strong>Mike Carey:\u00a0<\/strong>Our feeling was that each of these areas has some ideas that are really important for Big Data.\u00a0Borrowing from semi-structured data ideas, but also more traditional databases, leads you to a place where you have flexibility that parallel databases by themselves do not.\u00a0Borrowing from parallel databases leads to scale-out that semi-structured data work didn\u2019t provide (since scaling is orthogonal to data model) and with query processing efficiencies that parallel databases offer through techniques like hash joins and indexing &#8211; which <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/en.wikipedia.org\/wiki\/MapReduce');\"  href=\"http:\/\/en.wikipedia.org\/wiki\/MapReduce\" target=\"_blank\">MapReduce<\/a>-based data-intensive computing platforms like Hadoop and its language layers don&#8217;t give you. Borrowing from the MapReduce world leads to the open-source \u201cpricing\u201d and flexibility of Hadoop-based tools, and argues for the ability to process some of your queries directly over <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/hadoop.apache.org\/docs\/r1.2.1\/hdfs_design.html');\"  href=\"http:\/\/hadoop.apache.org\/docs\/r1.2.1\/hdfs_design.html\" target=\"_blank\">HDFS<\/a> data (which we call \u201cexternal data\u201d in AsterixDB, and do also support in addition to managed data).<\/p>\n<p><strong>Q6. How does the AsterixDB Data Model compare with the data models of NoSQL data stores, such as document databases like MongoDB and CouchBase, simple key\/value stores like Riak and Redis, and column-based stores like HBase and Cassandra?<\/strong><\/p>\n<p><strong>Mike Carey:\u00a0<\/strong>AsterixDB\u2019s data model is flexible \u2013 we have a notion of \u201copen\u201d versus \u201cclosed\u201d data types \u2013 it\u2019s a simple idea but it\u2019s unique as far as we know.\u00a0When you define a data type for records to be stored in an AsterixDB dataset, you can choose to pre-define any or all of the fields and types that objects to be stored in it will have \u2013 and if you mark a given type as being \u201copen\u201d (or let the system default it to \u201copen\u201d), you can store objects there that have those fields (and types) as well as any\/all other fields that your data instances happen to have at insertion time.<br \/>\nOr, if you prefer, you can mark a type used by a dataset as \u201cclosed\u201d, in which case AsterixDB will make sure that all inserted objects will have exactly the structure that your type definition specifies \u2013 nothing more and nothing less.<br \/>\n(We do allow fields to be marked as optional, i.e., nullable, if you want to say something about their type without mandating their presence.)<\/p>\n<p>What this gives you is a choice!\u00a0 If you want to have the total, last-minute flexibility of MongoDB or Couchbase, with your data being self-describing, we support that \u2013 you don&#8217;t have to predefine your schema if you use data types that are totally open.\u00a0(The only thing we insist on, at the moment, is that every type must have a key field or fields \u2013 we use keys when sharding datasets across a cluster.)<\/p>\n<p>Structurally, our data model was <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/en.wikipedia.org\/wiki\/JSON');\"  href=\"http:\/\/en.wikipedia.org\/wiki\/JSON\" target=\"_blank\">JSON<\/a>-inspired \u2013 it\u2019s essentially a schema language for a JSON superset \u2013 so we\u2019re very synergistic with <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.mongodb.com');\"  href=\"http:\/\/www.mongodb.com\" target=\"_blank\">MongoDB<\/a> or <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.couchbase.com');\"  href=\"http:\/\/www.couchbase.com\" target=\"_blank\">Couchbase<\/a> data in that regard.\u00a0<br \/>\nOn the other end of the spectrum, if you\u2019re still a relational bigot, you\u2019re welcome to make all of your data types be flat \u2013 don\u2019t use features like nested records, lists, or bags in your record definitions \u2013 and mark them all as \u201cclosed\u201d so that your data matches your schema.\u00a0With AsterixDB, we can go all the way from traditional relational to \u201cdon\u2019t ask, don\u2019t tell\u201d. As for systems with <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/en.wikipedia.org\/wiki\/BigTable');\"  href=\"http:\/\/en.wikipedia.org\/wiki\/BigTable\" target=\"_blank\">BigTable<\/a>-like \u201cdata models\u201d \u2013 I\u2019d personally shy away from calling those \u201cdata models\u201d.<\/p>\n<p><strong>Q7. How do you handle horizontal scaling?\u00a0And vertical scaling?<\/strong><\/p>\n<p><strong>Mike Carey:\u00a0<\/strong>We scale out horizontally using the same sort of divide-and-conquer techniques that have been used in commercial parallel relational DBMSs for years now, and more recently in Hadoop as well.\u00a0That is, we horizontally partition both data (for storage) and queries (when processed) across the nodes of commodity clusters.\u00a0Basically, our innards look very like those of systems such as <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.teradata.com');\"  href=\"http:\/\/www.teradata.com\" target=\"_blank\">Teradata<\/a> or <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.ibm.com\/developerworks\/data\/library\/techarticle\/0301milligan\/0301milligan.html');\"  href=\"http:\/\/www.ibm.com\/developerworks\/data\/library\/techarticle\/0301milligan\/0301milligan.html\" target=\"_blank\">Parallel DB2<\/a> or <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.microsoft.com\/de-de\/server-cloud\/products\/analytics-platform-system\/');\"  href=\"http:\/\/www.microsoft.com\/de-de\/server-cloud\/products\/analytics-platform-system\/\" target=\"_blank\">PDW from Microsoft<\/a> \u2013 we use join methods like <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/en.wikipedia.org\/wiki\/Hash_join');\"  href=\"http:\/\/en.wikipedia.org\/wiki\/Hash_join\" target=\"_blank\">parallel hybrid hash joins<\/a>, and we pay attention to how data is currently partitioned to avoid unnecessary repartitioning \u2013 but have a data model that\u2019s way more flexible.\u00a0And we\u2019re open source and free&#8230;.<\/p>\n<p>We scale vertically (within one node) in two ways.\u00a0First of all, we aren\u2019t memory-dependent in the way that many of the current Big Data Analytics solutions are; it\u2019s not that case that you have to buy a big enough cluster so that your data, or at least your intermediate results, can be memory-resident.<br \/>\nInstead, our physical operators (for joins, sorting, aggregation, etc.) all spill to disk if needed \u2013 so you can operate on Big Data partitions without getting \u201cout of memory\u201d errors.\u00a0The other way is that we allow nodes to hold multiple partitions of data; that way, one can also use multi-core nodes effectively.<\/p>\n<p><strong>Q8. What performance figures do you have for AsterixDB?<\/strong><\/p>\n<p><strong>Mike Carey:\u00a0<\/strong>As I mentioned earlier, we\u2019ve completed a set of initial performance tests on a small cluster at UCI with 40 cores and 40 disks, and the results of those tests can be found in a recently published AsterixDB overview paper that\u2019s hanging on our project web site\u2019s publication page (<a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/asterixdb.ics.uci.edu\/publications.html');\"  href=\"http:\/\/asterixdb.ics.uci.edu\/publications.html\">http:\/\/asterixdb.ics.uci.edu\/publications.html<\/a>).<br \/>\nWe have a couple of other performance studies in flight now as well, and we\u2019ll be hanging more information about those studies in the same place on our web site when they\u2019re ready for human consumption. There\u2019s also a deeper dive paper on the AsterixDB storage manager that has some performance results regarding the details of scaling, indexing, and so on; that\u2019s available on our web site too.\u00a0The quick answer to \u201chow does AsterixDB perform\u201d is that we\u2019re already quite competitive with other systems that have narrower feature sets \u2013 which we\u2019re pretty proud of.<\/p>\n<p><strong>Q9. You mentioned support for continuous data ingestion. How does that work?<\/strong><\/p>\n<p><strong>Mike Carey:<\/strong>\u00a0We have a special feature for that in AsterixDB \u2013 we have a built-in notion of <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/en.wikipedia.org\/wiki\/Data_feed');\"  href=\"http:\/\/en.wikipedia.org\/wiki\/Data_feed\" target=\"_blank\">Data Feeds<\/a> that are designed to simplify the lives of users who want to use our system for warehousing of continuously arriving data.<br \/>\nWe provide Data Feed adaptors to enable outside data sources to be defined and plugged in to AsterixDB, and then one can \u201cconnect\u201d a Data Feed to an AsterixDB data set and the data will start to flow in. As the data comes in, we can optionally dispatch a user-defined function on each item to do any initial information extraction\/annotation that you want.\u00a0 Internally, this creates a long-running job that our system monitors \u2013 if data starts coming too fast, we offer various policies to cope with it, ranging from discarding data to sampling data to adding more UDF computation tasks (if that\u2019s the bottleneck).\u00a0More information about this is available in the Data Feeds tech report on our web site, and we\u2019ll soon be documenting this feature in the downloadable version of AsterixDB.\u00a0(Right now it\u2019s there but \u201chidden\u201d, as we have been testing it first on a set of willing UCI student guinea pigs.)<\/p>\n<p><strong>Q10. What is special about the AsterixDB Query Language?\u00a0Why not use SQL?<\/strong><\/p>\n<p><strong>Mike Carey:\u00a0<\/strong>When we set out to define the query language for AsterixDB, we decided to define our own new language \u2013 since it seemed like everybody else was doing that at the time (witness <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/pig.apache.org');\"  href=\"http:\/\/pig.apache.org\" target=\"_blank\">Pig<\/a>, <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/en.wikipedia.org\/wiki\/Jaql');\"  href=\"http:\/\/en.wikipedia.org\/wiki\/Jaql\" target=\"_blank\">Jaql<\/a>, <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/en.wikipedia.org\/wiki\/Apache_Hive');\"  href=\"http:\/\/en.wikipedia.org\/wiki\/Apache_Hive\" target=\"_blank\">HiveQL<\/a>, etc.) \u2013 one aimed at our data model.\u00a0<br \/>\n<a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/en.wikipedia.org\/wiki\/SQL');\"  href=\"http:\/\/en.wikipedia.org\/wiki\/SQL\" target=\"_blank\">SQL<\/a> doesn\u2019t handle nested or open data very well, so extending ANSI\/ISO SQL seemed like a non-starter \u2013 that was also based on some experience working on <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/en.wikipedia.org\/wiki\/SQL');\"  href=\"http:\/\/en.wikipedia.org\/wiki\/SQL\" target=\"_blank\">SQL3<\/a> in the late 90\u2019s.\u00a0(Take a look at <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/docs.oracle.com\/cd\/B10501_01\/appdev.920\/a96624\/05_colls.htm');\"  href=\"http:\/\/docs.oracle.com\/cd\/B10501_01\/appdev.920\/a96624\/05_colls.htm\" target=\"_blank\">Oracle\u2019s nested tables<\/a>, for example.).\u00a0Based on our team\u2019s backgrounds in XML querying, we actually started there \u2013 <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/en.wikipedia.org\/wiki\/XQuery');\"  href=\"http:\/\/en.wikipedia.org\/wiki\/XQuery\" target=\"_blank\">XQuery<\/a> was developed by a team of really smart people from the SQL world (including <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/en.wikipedia.org\/wiki\/Donald_D._Chamberlin');\"  href=\"http:\/\/en.wikipedia.org\/wiki\/Donald_D._Chamberlin\" target=\"_blank\">Don Chamberlin<\/a>, father of SQL) as well as from the XML world and the functional programming world \u2013 so we started there.\u00a0We took XQuery and then started throwing the stuff overboard that wasn\u2019t needed for JSON or that seemed like a poor feature that had been added for XPath compatibility.<br \/>\nWhat remained was <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/asterixdb.ics.uci.edu\/documentation\/aql\/manual.html');\"  href=\"http:\/\/asterixdb.ics.uci.edu\/documentation\/aql\/manual.html\" target=\"_blank\">AQL<\/a>, and we think it\u2019s a pretty nice language for semistructured data handling.\u00a0We periodically do toy with the notion of adding a SQL-like re-skinning of AQL to make SQL users feel more at home \u2013 and we may well do that in the future \u2013 but that would be different than \u201creal SQL\u201d.\u00a0(The <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/docs.couchbase.com\/prebuilt\/n1ql\/n1ql-dp3\/');\"  href=\"http:\/\/docs.couchbase.com\/prebuilt\/n1ql\/n1ql-dp3\/\" target=\"_blank\">N1QL<\/a> effort at Couchbase is doing something along those lines, language-wise, as an example.\u00a0The <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/arxiv.org\/abs\/1405.3631');\"  href=\"http:\/\/arxiv.org\/abs\/1405.3631\" target=\"_blank\">SQL++<\/a> design from UCSD is another good example there.)<\/p>\n<p><strong>Q11. What level of concurrency and recovery guarantees does AsterixDB offer?<\/strong><\/p>\n<p><strong>Mike Carey:\u00a0<\/strong>We offer transaction support that\u2019s akin to that of current <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/en.wikipedia.org\/wiki\/NoSQL');\"  href=\"http:\/\/en.wikipedia.org\/wiki\/NoSQL\" target=\"_blank\">NoSQL stores<\/a>.\u00a0That is, we promise record-level ACIDity \u2013 so inserting or deleting a given record will happen as an atomic, durable action.\u00a0However, we don\u2019t offer general-purpose distributed transactions.\u00a0We support an arbitrary number of secondary indexes on data sets, and we\u2019ll keep all the indexes on a data set transactionally consistent \u2013 that we can do because secondary index entries for a given record live in the same data partition as the record itself, so those transactions are purely local.<\/p>\n<p><strong>Q12. How does AsterixDB compare with Hadoop?\u00a0What about Hadoop Map\/Reduce compatibility?<\/strong><\/p>\n<p><strong>Mike Carey:\u00a0<\/strong>I think we\u2019ve already covered most of that \u2013 Hadoop MapReduce is an answer to low-level \u201cparallel programming for dummies\u201d, and it\u2019s great for that \u2013 and languages on top like <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/en.wikipedia.org\/wiki\/Pig_(programming_tool)');\"  href=\"http:\/\/en.wikipedia.org\/wiki\/Pig_(programming_tool)\" target=\"_blank\">Pig Latin<\/a> and <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/en.wikipedia.org\/wiki\/Apache_Hive');\"  href=\"http:\/\/en.wikipedia.org\/wiki\/Apache_Hive\" target=\"_blank\">HiveQL<\/a> are better programming abstractions for \u201cdata tasks\u201d but have runtimes that could be much better.\u00a0We started over, much as the recent flurry of Big Data analytics platforms are now doing (e.g., <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.cloudera.com\/content\/cloudera\/en\/products-and-services\/cdh\/impala.html');\"  href=\"http:\/\/www.cloudera.com\/content\/cloudera\/en\/products-and-services\/cdh\/impala.html\" target=\"_blank\">Impala<\/a>,<a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/spark.apache.org');\"  href=\"https:\/\/spark.apache.org\" target=\"_blank\"> Spark<\/a>, and friends), but with a focus on scaling to memory-challenging data sizes.\u00a0We do have a MapReduce compatibility layer that goes along with our <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.ics.uci.edu\/~rares\/pub\/icde11-borkar.pdf');\"  href=\"http:\/\/www.ics.uci.edu\/~rares\/pub\/icde11-borkar.pdf\" target=\"_blank\">Hyracks<\/a> runtime layer \u2013 Hyracks is name of our internal dataflow runtime layer \u2013 but our MapReduce compatibility layer is not related to (or connected to) the AsterixDB system.<\/p>\n<p><strong>Q13. How does AsterixDB relate to Hadapt?<\/strong><\/p>\n<p><strong>Mike Carey:\u00a0<\/strong>I\u2019m not familiar with <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/hadapt.com');\"  href=\"http:\/\/hadapt.com\" target=\"_blank\">Hadapt<\/a>, per se, but I read the <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/db.cs.yale.edu\/hadoopdb\/hadoopdb.html');\"  href=\"http:\/\/db.cs.yale.edu\/hadoopdb\/hadoopdb.html\" target=\"_blank\">HadoopDB<\/a> work that fed into it.\u00a0<br \/>\nWe\u2019re architecturally very different \u2013 we\u2019re not Hadoop-based at all \u2013 I\u2019d say that HadoopDB was more of an expedient hybrid coupling of Hadoop and databases, to get some of the indexing and local query efficiency of an existing database engine quickly in the Hadoop world.\u00a0We were thinking longer term, starting from first principles, about what a next-generation BDMS might look like.\u00a0AsterixDB is what we came up.<\/p>\n<p><strong>Q14. How does AsterixDB relate to Spark?<\/strong><\/p>\n<p><strong>Mike Carey:\u00a0<\/strong><a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/spark.apache.org');\"  href=\"https:\/\/spark.apache.org\" target=\"_blank\">Spark<\/a> is aimed at fast Big Data analytics \u2013 its data is coming from HDFS, and the task at hand is to scan and slice and dice and process that data really fast.\u00a0Things like Shark and <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/spark.apache.org\/sql\/');\"  href=\"https:\/\/spark.apache.org\/sql\/\" target=\"_blank\">SparkSQL<\/a> give users SQL query power over the scanned data, but Spark in general is really catching fire, it appears, due to its applicability to Big Machine Learning tasks.\u00a0In contrast, we\u2019re doing Big Data Management \u2013 we store and index and query Big Data.\u00a0It would be a very interesting\/useful exercise for us to explore how to make AsterixDB another source where Spark computations can get input data from and send their results to, as we\u2019re not targeting the more complex, in-memory computations that Spark aims to support.<\/p>\n<p><strong>Q15. How can others contribute to the project?<\/strong><\/p>\n<p><strong>Mike Carey:\u00a0<\/strong>We would love to see this start happening \u2013 and we\u2019re finally feeling more ready for that, and even have some NSF funding to make AsterixDB something that others in the Big Data community can utilize and share.\u00a0<br \/>\n(Note that our system is <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/en.wikipedia.org\/wiki\/Apache_License');\"  href=\"http:\/\/en.wikipedia.org\/wiki\/Apache_License\" target=\"_blank\">Apache-style open source licensed<\/a>, so there are no \u201cgotchas\u201d lurking there.)<br \/>\nSome possibilities are:<\/p>\n<p>(1) Others can start to use AsterixDB to do real exploratory Big Data projects, or to teach about Big Data (or even just semistructured data) management.\u00a0Each time we\u2019ve worked with trial users we\u2019ve gained some insights into our feature set, our query optimizations, and so on \u2013 so this would help contribute by driving us to become better and better over time.<\/p>\n<p>(2) Folks who are studying specific techniques for dealing with modern data \u2013 e.g., new structures for indexing spatiotemporaltextual (J) data \u2013 might consider using AsterixDB as a place to try out their new ideas.<br \/>\n(This is not for the meek, of course, as right now effective contributors need to be good at reading and understanding open source software without the benefit of a plethora of internal design documents or other hints.)\u00a0We also have some internal wish lists of features we wish we had time to work on \u2013 some of which are even doable from \u201coutside\u201d, e.g., we\u2019d like to have a much nicer browser-based workbench for users to use when interacting with and managing an AsterixDB cluster.<\/p>\n<p>(3) Students or other open source software enthusiasts who download and try our software and get excited about it \u2013 who then might want to become an extension of our team \u2013 should contact us and ask about doing so.\u00a0(Try it first, though!)\u00a0 We would love to have more skilled hands helping with fixing bugs, polishing features, and making the system better \u2013 it\u2019s tough to build robust software in a university setting, and we would especially welcome contributors from companies.<\/p>\n<p>Thanks very much for this opportunity to share what we\u2019ve being doing!<\/p>\n<p>&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;<br \/>\n<strong>Michael J. Carey<\/strong> <em>is a Bren Professor of Information and Computer Sciences at UC Irvine.<br \/>\nBefore joining UCI in 2008, Carey worked at BEA Systems for seven years and led the development of BEA\u2019s AquaLogic Data Services Platform product for virtual data integration. He also spent a dozen years teaching at the University of Wisconsin-Madison, five years at the IBM Almaden Research Center working on object-relational databases, and a year and a half at e-commerce platform startup Propel Software during the infamous 2000-2001 Internet bubble. Carey is an ACM Fellow, a member of the National Academy of Engineering, and a recipient of the ACM SIGMOD E.F. Codd Innovations Award. His current interests all center around data-intensive computing and scalable data management (a.k.a. Big Data).<\/em><\/p>\n<p><strong>Resources<\/strong><\/p>\n<p>&#8211; AsterixDB Big Data Management System (BDMS): <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.odbms.org\/2014\/05\/asterixdb-big-data-management-system-bdms\/');\"  href=\"http:\/\/www.odbms.org\/2014\/05\/asterixdb-big-data-management-system-bdms\/\" target=\"_blank\">Downloads, Documentation, Asterix Publications.<\/a><\/p>\n<p><strong>Related Posts<\/strong><\/p>\n<p>&#8211;<strong><a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.odbms.org\/blog\/2014\/09\/interview-mithun-radhakrishnan\/');\"  href=\"http:\/\/www.odbms.org\/blog\/2014\/09\/interview-mithun-radhakrishnan\/\" target=\"_blank\">Hadoop at Yahoo. Interview with Mithun Radhakrishnan. ODBMS Industry Watch, September 21, 2014<\/a><\/strong><\/p>\n<p>&#8211;<strong><a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.odbms.org\/blog\/2014\/06\/john-schroeder-ceo-cofounder-mapr-technologies\/');\"  href=\"http:\/\/www.odbms.org\/blog\/2014\/06\/john-schroeder-ceo-cofounder-mapr-technologies\/\" target=\"_blank\">On the Hadoop market. Interview with John Schroeder. ODBMS Industry Watch, June 30, 2014<\/a><\/strong><\/p>\n<p><strong>Follow ODBMS.org on Twitter: <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/twitter.com\/odbmsorg');\"  href=\"https:\/\/twitter.com\/odbmsorg\" target=\"_blank\">@odbmsorg<\/a><\/strong><br \/>\n##<\/p>\n<!-- AddThis Advanced Settings generic via filter on the_content --><!-- AddThis Share Buttons generic via filter on the_content -->","protected":false},"excerpt":{"rendered":"<p>&#8220;To distinguish AsterixDB from current Big Data analytics platforms \u2013 which query but don&#8217;t store or manage Big Data \u2013 we like to classify AsterixDB as being a \u201cBig Data Management System\u201d (BDMS, with an emphasis on the \u201cM\u201d)&#8221;&#8211;Mike Carey. Mike Carey and his colleagues have been working on a new data management system for [&hellip;]<!-- AddThis Advanced Settings generic via filter on get_the_excerpt --><!-- AddThis Share Buttons generic via filter on get_the_excerpt --><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[35,733,66,155,239,298,355,734,391,446,549],"_links":{"self":[{"href":"https:\/\/www.odbms.org\/blog\/wp-json\/wp\/v2\/posts\/3501"}],"collection":[{"href":"https:\/\/www.odbms.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.odbms.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.odbms.org\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.odbms.org\/blog\/wp-json\/wp\/v2\/comments?post=3501"}],"version-history":[{"count":13,"href":"https:\/\/www.odbms.org\/blog\/wp-json\/wp\/v2\/posts\/3501\/revisions"}],"predecessor-version":[{"id":3515,"href":"https:\/\/www.odbms.org\/blog\/wp-json\/wp\/v2\/posts\/3501\/revisions\/3515"}],"wp:attachment":[{"href":"https:\/\/www.odbms.org\/blog\/wp-json\/wp\/v2\/media?parent=3501"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.odbms.org\/blog\/wp-json\/wp\/v2\/categories?post=3501"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.odbms.org\/blog\/wp-json\/wp\/v2\/tags?post=3501"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}