On Couchbase Analytics. Q&A with Sachin Smotra
Q1. How are customers using Couchbase Analytics today?
Couchbase Analytics is being used in multiple industries and across many different use cases. Some examples include:
- A top cruise line is revolutionizing the passenger experience by using analytics to add a personal touch from ship to shore. Even when they’re at sea with spotty connectivity, the cruise ships are able to collect, process, and analyze data from onboard sensors and passenger medallions. Couchbase enables the cruise line to operationalize the analytics in real time without the heavy lifting of data lakes, data warehouses, or ETL processes.
- A leading retailer uses real-time shopping cart analysis during Black Friday and summer sales to eliminate duplicate items in carts, identify top-selling items, and conduct other data exploration directly on JSON data. The ability to process heavy transactional data in real time without impacting throughput, latency, or the online shopping experience was this customer’s most critical requirement.
- A prominent healthcare provider uses Couchbase Analytics to eliminate the need for an ETL pipeline. Their previous analytics infrastructure required a complex pipeline with an elaborate schema that had to be redesigned every time they needed a new report. With Couchbase, they now analyze large amounts of JSON data in its original state to drive better outcomes for both patients and providers.
- A Major League Baseball team uses Couchbase Analytics to modernize the fan experience. For instance, they optimize food cart placement in real time based on stadium occupancy. As fans scan their tickets to get seated, concessions personnel use a live seat map to place the food carts closer to the fans so they can get their food and drinks faster.
Q2. What are the digital analytics trends in 2019, and why do they matter?
As data grows in value, enterprises are scrutinizing their data systems more closely. In particular, they’re trying to improve data quality across the board in order to better leverage AI and machine learning (ML) technologies.
Enterprises continue to possess an insatiable appetite for data analytics, and that includes carrying out analytics on operational data, which is often referred to as hybrid workloads. Couchbase’s Analytics Service targets hybrid workloads to help enterprises react quickly to incoming operational data.
The reason this matters so much is because data quality and availability can vary wildly within an organization, and it can take a lot of time to determine what data is clean, up to date, and trustworthy.
Q3. Where exactly does Couchbase Analytics fit in?
Let me start with – Couchbase Analytics is not a data lake!
A typical enterprise might have a data warehouse, a data lake, and hybrid database.
The data warehouse is designed to handle massive amounts of data, but with a predefined schema. As a result, any process writing to the data warehouse needs to understand the schema to be able to write to it.
The data lake, which is now being called the kitchen sink of enterprise data, relaxes the need for the upfront definition of a schema. As a result, any process can write to the lake without any rigid schema requirements. However, data lakes do require a schema on read, which means that any consumer of data stored in the lake would need to define a schema.
For operational data being generated by applications, a hybrid database provides the ability to analyze data as it comes in with a heavy transactional workload, but without impacting the transaction throughput or latency.
Couchbase Analytics is a hybrid NoSQL database that enables rapid time to insight for operational data.
Q4. MPP is not new and has existed in the relational world for many years – what is new in Couchbase Analytics?
Massively parallel processing has existed in the database world for over 30 years. There are many implementations of MPP in the relational world – Teradata, HP Vertica, Amazon Redshift, and Microsoft PDW are great examples. But these products work on flat relational data with predefined schemas. As application architectures have evolved, JSON has become the lingua franca for data modelling. NoSQL databases like Couchbase allow you to store and query JSON documents, which are much more natural for many applications. With traditional data warehouse technology, the schemaless and nested JSON data must then be converted into a flat relational schema. Couchbase Analytics allows for the analysis of JSON data in its natural form without any transformation. (We call this NoETL for NoSQL.) The Couchbase Analytics engine is a full-fledged parallel query processor that supports parallel joins, aggregations, and sorting. It’s based on best of breed algorithms from 30+ years of relational MPP R&D, but it’s for JSON data.
In our upcoming release, Couchbase Analytics will add support for the parallel execution of window analytic functions natively on JSON data. Even though window functions have existed in the relational world for many years, this is a big deal for NoSQL! It’s one more great example of Couchbase bringing the business-critical features of relational systems to the world of NoSQL.
Q5. How does the analytics ecosystem keep up with rapidly evolving applications that are taking advantage of the schema flexibility of JSON?
Today, the analytics ecosystem breaks down with rapidly evolving application schemas. I say this because the analytics schema needs to be redesigned every time the application schema changes. And every time the analytics schema needs to change, the ETL pipeline also needs to be reconfigured, which impacts the time to insight.
If an application is using a NoSQL database like Couchbase, then schema evolution for the application doesn’t require a heavy lift. This makes the application development process agile. Couchbase Analytics was created to support the analytical needs of agile application development. Application data is continuously streamed to the analytics engine, and changes in schema don’t interfere with the immediate availability of the application data for analytical processing.
Q6. Is the SQL/JSON extension the right answer for JSON storage and querying?
My answer is an emphatic no!
To clarify, the SQL/JSON extension only addresses how to query JSON data. The JSON data is stored in the underlying database whether it’s relational or NoSQL.
In a relational database, JSON documents are stored as lumps of JSON data (possibly as strings, BLOBS, or something else), and a set of SQL functions are provided for processing the lumps with JSON semantics. This works great if you have an existing application with lots of data stored in tables and lots of SQL code for maintaining and querying your data. If you just need to extend your application to include some JSON data, you can add a new JSON column and think of the JSON instances as documents associated with your data.
In Couchbase, JSON is the natural data format for your application. There’s no impedance mismatch between how an application interacts with data and how it’s stored and managed. The ability to analyze JSON data without flattening it first – or predefining and evolving its schema – is key to maintaining application agility. Operational analytics gives enterprises the ability to react to their incoming operational data in real time.
Both of the above approaches build on the power and familiarity of SQL and require SQL extensions. The first approach extends the SQL type system with a new column type for JSON data. In this approach, JSON is something of an afterthought, a second-class citizen. In contrast, the second approach generalizes the SQL language to handle schemaless and nested JSON objects with complex internal structures that would otherwise need to be modelled in multiple relational tables.
If you’d like to learn more, I recommend reading Comparing Two SQL-Based Approaches for Querying JSON: SQL++ and SQL:2016 written by Don Chamberlin, the co-inventor of SQL.
Q7. What is SQL++?
SQL++ is a query language for semi-structured data based on the JSON data format. The road to SQL++ began nearly a decade ago when the limitations of “strict relational model,” (then the most common database query language) were realized. SQL was created for the relational model, which implies that you’re working with tables. The structure of a table is uniform – each row has the same columns, and each column has the same type in every row. Even when data is missing its column is still there – for example, when a customer has no middle name, the “middle name” column is still there.
The world of JSON and “NoSQL” brings new challenges for querying – schemas are optional and often absent, and objects are nested and often heterogeneous in structure. SQL does not provide a complete and efficient solution for querying JSON or semi-structured data in JSON form.
Semi-structured data accounts for 85% of all business data today and it will continue to grow as AI, ML, and the internet of things continue to gain momentum. As a result, the need of the hour is a query language that brings together the full power of SQL with the flexibility of JSON and helps advance the management of document-oriented data and the JSON data format.
SQL++ shows how to create a principled, next-generation query language that brings together the full power of SQL with the flexibility of JSON, while allowing for a few differences since the “flat world” or schema-based assumptions made by SQL no longer hold for JSON. It also exploits the generality of the nested data model, for instance adding richer support for grouping and demystifying grouped aggregation.
Don Chamberlin’s book SQL++ for SQL users: A Tutorial explains all this in more detail.
Q8. Who are the other database vendors that have implemented SQL++?
Apache AsterixDB, a highly scalable open source big data management system, uses SQL++ as the end-user query language.
Earlier this year, AWS announced its PartiQL query language, which is based on SQL++ and uses a SQL-compatible language to access relational, semi-structured, and nested data.
I’d also like to point out that Couchbase Server 6.0 included the first commercial implementation of the SQL++ framework. N1QL for Analytics, which serves as the query language for the Couchbase Analytics Service, is based on SQL++.
Q9. You have recently announced new features in Couchbase Server that include support for multi-document ACID transactions and a host of query enhancements. How do you bridge such traditional database features to NoSQL databases?
The foundation of the NoSQL movement was built on a distributed architecture for next-generation applications that required performance, scale, and availability with a flexibility for access and schema that wasn’t possible with relational databases. Modern application developers chose to remove the impedance mismatch by persisting JSON data instead of converting object data into multiple relational tables.
Over the years, the applicability of NoSQL to enterprise architectures has broadened and there are now many instances of NoSQL replacing traditional RDBMS. This trend has led to many customers requesting that features available in relational technology be made available in NoSQL databases as well. Support for ACID is one such instance.
While Couchbase started as a key-value store, we’ve added a rich and compelling set of services – query, indexing, full-text search, mobile and IOT, analytics, and eventing – that enable enterprises to address a broad range of use cases. As database stacks are modernized, Couchbase is replacing relational databases for both new and existing applications.
Like many NoSQL databases, Couchbase has always supported single document transactions. This implied that application developers had to denormalize data for anything requiring transactional guarantees. As more workloads move to Couchbase, especially from modernization of RDBMS applications, we’ve been inundated with requests to support multi-document transactions. Couchbase Server 6.5 will add that capability early in 2020. As a result, there will be no longer be a need to denormalize data for transactional guarantees.
The Couchbase Query Service allows SQL developers to bring their skills to the world of NoSQL. The new features in Couchbase Server 6.5 will include additional familiar SQL “goodies” such as window functions, index advisor, cost-based optimizer, and more. But the key thing for developers is the ability to declaratively join, group, and aggregate large collections of documents. Couchbase is the only NoSQL database that supports all these capabilities.
Another request we sometimes get is to consolidate the business logic in the database tier to replace triggers in relational databases. The Couchbase Eventing Service makes it easy to develop, deploy, and maintain data-driven business logic in a centralized platform. It’s a highly available, performant, and scalable service that enables user-defined business logic to be triggered in real time on the server when application interactions create changes in data.
We continue to work with customers to solve their most pressing needs and add new features to help them modernize their legacy relational architectures.
Qx. Anything else you wish to add?
The relational model is now 49 years old! And it includes 30+ years of MPP technology R&D, plus more recent work on hybrid (HTAP or translytical) architectures. With Couchbase Server and Couchbase Analytics Service, all the benefits of that work can now be brought to bear on your application’s document data. One might say, “you can have your data and query it too.”
Thanks for giving us the opportunity to share our thoughts with your readers.
Sachin Smotra’s career spans more than 15 years building software products across various domains including Java Enterprise software, DRM Solutions for mobile games and web conferencing. As Director Product Management at Couchbase, he is a hands-on product leader responsible for Couchbase Analytics, Mobile & IOT product lines including evangelizing the product strategy and vision with customers, partners, developers and analysts.
Sponsored by Couchbase.