On SQL++ and Couchbase N1QL for Analytics. Interview with Mike Carey.
“N1QL for Analytics is the first commercial implementation of SQL++.” –Mike Carey
I have interviewed Michael Carey, Bren Professor of Information and Computer Sciences and Distinguished Professor of Computer Science at UC Irvine, where he leads the AsterixDB project, as well as a Consulting Architect at Couchbase. We talked about SQL++, the AsterixDB project, and the Couchbase N1QL for Analytics.
Q1. You are Couchbase’s Consulting Chief Architect. What are your main tasks in such a role?
Mike Carey: This came about when Couchbase began working on the effort that led to the recently released Couchbase Analytics Service, a service that was born when Ravi Mayuram (Couchbase’s Senior VP of Engineering and CTO) and I realized that Couchbase and the AsterixDB project shared a common vision regarding what future data management systems ought to look like. Rather than making me quit my day job, I was given the opportunity to participate in a consulting role and build a team within Couchbase to make the Analytics Service happen — using AsterixDB as a starting point. I guess now I’m kind of a mini-CTO for database-related issues; I primarily focus on the Analytics Service, but I also pay attention to the Query Service and the Couchbase Data Platform as a whole, especially when it comes to things like its query capabilities. I spend one day a week up at Couchbase HQ, at least most weeks. It’s really fun, and this keeps me connected to what’s happening in the “real world” outside academia.
Q2. What is SQL++ ? And what is special about it?
Mike Carey: SQL++ is a language that came out of work done by Prof. Yannis Papakonstantinou and his group at UC San Diego. Prior to SQL++, in the AsterixDB project, we had invented and implemented a full query language for semi-structured data called AQL (short for Asterix Query Language) based on a data model called ADM (short for Asterix Data Model). ADM was the result of realizing back in 2010 that JSON was coming in a pretty big way — we looked at JSON from a database data modeling perspective and added some things inspired by object databases that were missing. Most notable were the option to specify schemas, at least partially, if desired, and the ability to have multisets as well as arrays as multi-valued fields. AQL was the result of looking at XQuery, since it had been designed by a group of world experts to deal with semi-structured data, and then throwing out its “XML cruft” in order to gain a nice query language for ADM. To make AQL a bit more natural for SQL users, we also allowed some optional keyword substitutions (such as SELECT for RETURN and FROM for FOR). We had a pretty reasonable technical explanation for users as to why AQL was what it was — why it wasn’t just a SQL extension. Users listened and learned AQL, but they always seemed to wistfully sigh and continue to wish that AQL was more directly like SQL (in its syntax and not just its query power).
More or less in parallel, Yannis and friends were building a data integration system called FORWARD to integrate data of varied shapes and sizes from heterogeneous data stores. The FORWARD view of data was based on a semi-structured data model, and SQL++ was the SQL-based language framework that Yannis developed to classify the query capabilities of the stores. It also served as the integration language for FORWARD’s end users. At some point he approached us with a draft of his SQL++ framework paper, getting our attention by saying nice things about AQL relative to the other JSON query languages (:-)), and we took a look. Pretty quickly we realized that SQL++ was very much like AQL, but with a SQL-based syntax that would make those wistful AQL users much happier. Yannis did a very nice job of extending and generalizing SQL, allowing for a few differences where needed, such as where SQL had made “flat-world” or schema-based assumptions that no longer hold for JSON, and exploiting the generality of the nested data model, like adding richer support for grouping and de-mystifying grouped aggregation.
We have since “re-skinned” Apache AsterixDB to use SQL++ as the end-user query language for the system. This was actually relatively easy to do since all of the same algebra and physical operators work for both. We recently deprecated AQL altogether as an end-user language.
Q3. What is N1QL for Analytics?
Mike Carey: The Couchbase Analytics service is a component of the Couchbase Data Platform that allows users to run analytical-sized queries over their Couchbase JSON data. N1QL for Analytics is the product name for the end-user query language of Couchbase Analytics. It’s a dialect of SQL++, which itself is a language framework; the framework includes a number of choices that a SQL++ implementer gets to pin down about details like data types, missing information, supported functions, and so on. N1QL for Analytics could have been called “Couchbase SQL++”, but N1QL (non-1NF query language) is what Couchbase originally called the SQL-inspired query language for its Query service. A decision was made to keep the N1QL brand name, while adding “for Query” or “for Analytics” to more specifically identify the target service. Over time both N1QLs will be converging to the same dialect of SQL++. The bottom line is that N1QL for Analytics is the first commercial implementation of SQL++.
By the way, there’s a terrific new book available on Amazon called “SQL++ for SQL Users: A Tutorial.” It was written by Don Chamberlin, of SQL fame, for folks who want to learn more about SQL++ (from one of the world’s leading query language experts).
Q4. Is N1QL for Analytics based entirely on the SQL++ framework?
Mike Carey: Indeed it is. As I mentioned, N1QL for Analytics is really a dialect of SQL++, having chosen a particular combination of detailed settings that the framework provides options for. In the future it may gain other extensions, e.g., support for window queries, but right now, N1QL for Analytics is based entirely on the SQL++ framework.
Q5. How is new Couchbase Analytics influenced by the open-source Apache AsterixDB project?
Mike Carey: You’ve probably seen those computer ads in magazines that say “Intel Inside,” yes? In this case, the ad would say “Apache AsterixDB Inside”… 🙂
Q6. Specifically, did you re-use the Apache AsterixDB query engine? Or else?
Mike Carey: Specifically, yes. The Couchbase Data Platform, internally, is based on a software bus that the Data service (the Key/Value store service) broadcasts all data events on — and components like the Index service, Full Text service, Cross Datacenter Replication service, and others are all bus listeners. The Analytics service is a listener as well, and it manages a real-time replica of the KV data in order to make that data immediately available for analysis in a performance-isolated manner. Performance isolation is needed so that analytical queries don’t interfere with the front-end applications. Under the hood, the Analytics service is based on Apache AsterixDB — its storage facilities are used to store and manage the data, and its query engine powers the parallel query processing. The developers at Couchbase contribute their work on those components back to the Apache AsterixDB open source, and these days they’re among its most prolific committers. Couchbase Analytics also has some extensions that are only available from Couchbase — including integrated system management, cluster resizing, and a nice integrated query console — but the core plumbing is the same.
Q7. SQL does not provide an efficient solution for querying JSON or semi-structured data in JSON form. Can you explain how Couchbase Analytics analyzes data in JSON format? What is that capability useful for?
Mike Carey: Couchbase Analytics supports a JSON-based “come as you are” data model rather than requiring data to be normalized and schematized for analysis. We like to say that this gives users “NoETL for NoSQL.” You can perhaps think of it as being a data mart for Couchbase application data. The application folks think about their data naturally; if it’s nested, it’s allowed to be nested (e.g., an order object can contain a nested set of line items and a nested shipping address), and if it’s heterogeneous, it’s allowed to be heterogeneous (e.g., an electronic product can have different descriptive data than a clothing product or a furniture product). Couchbase Analytics allows data analysis on data that looks like that — data can “come as it is” and SQL++ is ready to query it in that “as is” form. You can do all the same analyses that you could do if you first designed a relational schema and wrote a collection of ETL scripts to move the data into a parallel SQL DBMS — but without having to do all that. Instead, you can now “have your data and query it too” in its original, natural, front-end JSON structure.
Q8. Can you please explain the architecture behind Couchbase`s MPP engine for JSON data?
Mike Carey: Sure, that’s easy — I can pretty much just refer you to the body of literature on parallel relational data management. (For an overview, see the classic DeWitt and Gray CACM paper on parallel database systems.)
Under the hood, the query engine for Couchbase Analytics and Apache AsterixDB looks like a best-practices parallel relational query engine. It uses hash partitioning to scale out horizontally in an MPP fashion, and it using best-practices physical operators (e.g., dynamic hash join, broadcast join, index join, parallel sort, sort-based and hash-based grouped aggregation, …) to deal gracefully with very large volumes of data. The operator set and the optimizer rules have just been extended where needed to accommodate nesting and schema optionality. Data is hash-partitioned on its primary key (the Couchbase key), with optional local secondary indexes on other fields, and queries run in parallel on all nodes in order to support linear speed-up and/or scale-up.
Q9. Do you think other database vendors will implement their own version/dialect of SQL++ ?
Mike Carey: Indeed I do. It’s a really nice language, and it makes a ton of sense as the “right” answer to querying the more general data models that one gets when one lets down their relational guard. It’s a whole lot cleaner than the “JSON as a column type” approach to adding JSON support to traditional RDBMSs in my opinion.
Qx. Anything else you wish to add?
Mike Carey: I teach the “Introduction to Data Management” class at UC Irvine as part of my day job. Our class sizes these days are exceeding 400 students per quarter — database systems are clearly not dead in students’ eyes! For the past few years I’ve been spending the last bit of the class on “NoSQL technology” — which to me means “no schema required” — and I’ve used SQL++ for the associated hands-on homework assignment. It’s been great to see how quickly and easily (relatively new!) SQL users can get their heads around the more relaxed data model and the query power of SQL++. Some faculty friends at the University of Washington have done this as well, and their experience there has been similar. I would like to encourage others to do the same! With SQL++, richer data no longer has to mean writing get/put programs or effectively hand-writing query plans, so it’s a very nice platform for teaching future generations about the emerging NoSQL world and its concepts and benefits.
Michael Carey received his B.S. and M.S. degrees from Carnegie-Mellon University and his Ph.D. from the University of California, Berkeley. He is currently a Bren Professor of Information and Computer Sciences and Distinguished Professor of Computer Science at UC Irvine, where he leads the AsterixDB project, as well as a Consulting Architect at Couchbase, Inc. Before joining UCI in 2008, he worked at BEA Systems for seven years and led the development of their AquaLogic Data Services Platform product for virtual data integration. He also spent a dozen years at the University of Wisconsin-Madison, five years at the IBM Almaden Research Center working on object-relational databases, and a year and a half at e-commerce platform startup Propel Software during the infamous 2000-2001 Internet bubble. He is an ACM Fellow, an IEEE Fellow, a member of the National Academy of Engineering, and a recipient of the ACM SIGMOD E.F. Codd Innovations Award. His current interests center around data-intensive computing and scalable data management (a.k.a. Big Data).
SQL++ For SQL Users: A Tutorial, Don Chamberlin, September 2018 (Free Book 143 pages)
Follow us on Twitter: @odbmsorg