Q&A with Venkat Venkataramani — Rockset GA

Q1. Last fall, Rockset exited stealth. What is the mission statement of Rockset?

Rockset’s mission is to make all products and decisions in the world data powered.

Q2. Can you please give some examples on how your data system is supposed to make it easy for developers to build data-driven apps? What are the main benefits with respect to other Data Platforms already existing in the market?

Rockset is a serverless search and analytics engine that eliminates ETL, servers and database administration.

In a single click, data scientists can turn complex loosely structured data sets in JSON or Parquet into fast SQL tables. So data scientists can spend more time working with their data, testing hypothesis and running experiments, instead of building complex data pipelines or ETL.

All data loaded into Rockset are automatically optimized for fast powerful SQL processing. Developers can put their data sets to work immediately by building applications, microservices and live dashboards directly on top of Rockset without ever provisioning a server or performing any database administration.

Q3. Businesses are struggling with tons of high value, low quality data in fragmented systems like data lakes, NoSQL databases and data streams. In Rockset, data can be ingested from data streams, data lakes, and databases. How do you handle such variety of data?

Rockset is built ground up to work with modern semi-structured data formats such as JSON or Parquet unlike traditional systems. Rockset’s dynamic type system allows it to automatically adapt to loosely structured data sets and organize them in a way that allows fast and efficient SQL processing.

Every data management system has a type system. Traditional SQL based data management systems have a strong and static type system — which requires the shape of a data set to be defined before data can be loaded into it. This limitation is generally overcome by building extensive ETL pipelines. Other schemaless data management systems have a weak and dynamic type system — so it is easy to load data into these systems, but these systems do not provide powerful SQL-based data processing.

Rockset has strong dynamic typing, where the data type is associated with the value of the field in every column, rather than entire columns. This means you can execute strongly typed SQL queries on dynamically typed data, making it easier to work with modern datasets using SQL without any ETL.

Q4. Specifically what is Converged Indexing™, and how is it different with respect to classical Indexing methods?

“Converged Indexing” is the method with which all data is organized in Rockset. The converged indexing approach combines the properties of multiple general purpose indexes such as in a document index, an inverted index, and a columnar index, in a single data structure.This approach also allows Rockset to handle dynamically typed data and index highly nested data structures like JSON with many levels of nesting. While classical indexing methods are made to suit a particular application, Converged Indexing allows for a wide spectrum of applications to be fast out of the box, without any database administration, schema modeling or query tuning.

Q5. Rockset’s serverless data backend continuously ingests raw data as it is generated and delivers SQL queries. How do you ensure scalability and real time response time?

Rockset stores all data in a converged index that provides Rockset’s query optimizer plenty of options to come up with an optimal query execution strategy. Rockset distributes the data into a large number of nodes in the backend and employs parallel query execution strategies to ensure fast response time. As data sets get larger, the data is automatically distributed into a larger number of nodes, and all queries employ a higher degree of parallelism.

Rockset is one of the few distributed databases to successfully implement a bottom-up query execution strategy. In general, a bottom-up approach results in lower latencies because the nodes do not wait to be polled for resource availability and they can also preemptively send results for certain types of queries, for example: order by. Traditional single node systems have already adopted this type of bottom-up optimization, but it is complicated to implement in a distributed system so most distributed databases still rely on the top-down approach.

Q6. How aggregation queries on multiple data sets are executed in Rockset?

Rockset distributes the data into a large number of nodes in the backend and aggregations queries are split into smaller tasks that can be executed in a parallel fashion. For example, say you want to find the average value of a column over a large data set that is distributed over 100 nodes in the Rockset backend. When such a query is issued, Rockset will break down the query into 101 parts – the first 100 tasks will instruct every index node to compute their local sum and the local count and send that result to an aggregator node, which will wait to receive the partial sum and partial count from all the 100 backend nodes, compute the global average and return it to the user.

In this simple example, a single aggregator can perform the task of computing the global average without becoming the bottleneck but for many real-world queries this may not be true. Rockset query planner will automatically employ multiple aggregators that can work in parallel in those situations.

Q7. How do you analyze an incoming query and create an intelligent query plan for serving it? Is this new?

Query optimizers and planners are not new and have been around for as long as databases have existed. What makes our query optimizer and query planner different are the following aspects:

1. Converged indexing provides Rockset’s optimizer a set of options (such as using inverted index to execute the SQL WHERE clause quickly) that are not available for traditional SQL optimizers.
2. Dealing with dynamic typing and nested data structures introduces new challenges to Rockset’s query planner to ensure that performance is not sacrificed during query execution.
3. Rockset’s query planners are 100% cloud native, so if the query planners observe insufficient resources to handle a particular workload, they instruct Rockset’s schedulers to automatically grow the cluster to ensure good performance.
4. Rockset automatically indexes all data and is delivered as a SaaS, and thus our optimizer design gives us the opportunity to use machine learning (ML models) to train it for each individual customer based on their specific query patterns going forward, rather than using generic optimization rules.

Q8. PDFs are the de facto standard for distributing and sharing fixed-layout documents. How is Rockset supporting SQL queries on PDF files?

Rockset makes it easy to build a search application on PDF data. All text present within PDF documents loaded into Rockset are automatically extracted and that text can be tokenized during ingest and searched through using SQL.

Q9. What is the roadmap ahead?

  1. Support for ML powered apps: allow users to turn any ML model into a user-defined function in Rockset, so that developers can build powerful ML apps with just SQL.
  2. 1-click serverless microservices: turn any SQL query into a REST API endpoint, so that you can turn any SQL into a serverless lambda in a single click.
  3. Bring-in data from anywhere in any form: support for bringing in data from more data sources and more data formats, expand coverage to auto-import data from all databases, data streams and data lakes in AWS, Azure and GCP.

Resources

*Rockset Home*

*Rockset’s dynamic type system (mentioned in Q3): *

*Converged Indexing*


Venkat Venkataramani — Rockset CEO and co-founder
<Venkat Venkataramani is CEO and co-founder of Rockset. He was previously an Engineering Director in the Facebook infrastructure team responsible for all online data services that stored and served Facebook user data. Collectively, these systems worked across 5 geographies and and served more than 5 billion queries a second. Prior to Facebook, Venkat worked on the Oracle Database.

You may also like...