On InfluxData’s New Storage Engine. Q&A with Andrew Lamb
We just announced our new columnar storage engine, InfluxDB IOx, optimized for time series data, that delivers high-volume ingestion, unbounded cardinality, and super fast queries using Flux, InfluxQL, and SQL.
Q1. InfluxData just announced its new storage engine. What’s the story behind the development of the storage engine?
We just announced our new columnar storage engine, InfluxDB IOx, optimized for time series data, that delivers high-volume ingestion, unbounded cardinality, and super fast queries using Flux, InfluxQL, and SQL. IOx’s architecture is designed to improve performance, scalability, and resilience, and provide a base for new advanced analytics use cases focused specifically on time series.
We have been working on InfluxDB IOx since 2020 with multiple InfluxData engineers. Continuing InfluxData’s core value of commitment to open source, we have contributed as well as benefited significantly from considerable collaboration with the open source developer community. Paul Dix says the announcement represents the largest leap forward for InfluxDB core since we introduced our TSM storage engine in 2016.
InfluxDB IOx addresses several key technical challenges specific to times series data. The first is high- cardinality data, where there are many distinct values in the tags (also called attributes) of the data. IOx eliminates the limits on inserting high cardinality data. It also significantly improves the interoperability of time series data with the rest of the big data ecosystem. We added native SQL support for queries, alongside support for more time series-focused Flux and InfluxQL. Along with PostgreSQL wire protocol compatibility, users can connect a broader range of third-party tools than ever before.
Q2. Tell us about the underlying technologies your team used to build the new storage engine.
The entire database is built around a project called Apache Arrow, an open source in-memory specification for columnar data. The columnar structure makes it very quick to do analytical queries on time series data and interoperates with the broader ecosystem. Similarly, we use Apache Parquet as the native storage format, and the DataFusion query engine to provide a Postgres-compatible SQL dialect, as well as a parser, planner, optimizer, and execution engine.
Q3. Why Arrow, DataFusion, and Parquet?
Since its founding, InfluxData has been a true believer in the power of open source software and open standards, and IOx follows this same tradition. Since Paul Dix’s initial announcement in 2020, we have nurtured the following projects, as well as others:
- Apache Arrow is a language-agnostic software framework for developing high-performance data analytics applications that process columnar data. Arrow standardizes industry best practices for columnar data layout and computation for analytic databases. In InfluxDB IOx, Arrow provides two key capabilities:
1. A standard and efficient way to exchange data between the database and the query processing engine
2. Fast and increasingly extensive interoperability with a broader ecosystem of data processing and analysis tools.
In addition to the in-memory format for columnar data, Arrow also includes Arrow Flight and the newly announced Apache Arrow Flight SQL, client/server protocols built for “high performance transfer of large datasets over network interfaces”.
- DataFusion is a Rust-native Extensible SQL query engine that uses Apache Arrow as its in-memory format. Simply by using DataFusion in the core, InfluxDB IOx supports SQL, out of the box. As the DataFusion project continues to mature, functionality flows directly into InfluxDB IOx as well as the other systems built on DataFusion. This alignment allows us to work with a worldwide collection of amazing engineers both inside and outside InfluxData to efficiently and quickly develop advanced database technology in DataFusion
- Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. Originally designed as part of the Hadoop ecosystem, it has become the standard format for large-scale data storage and query. Almost all analytic systems support Parquet either natively or via connectors. Parquet provides efficient data compression and encoding schemes across a wide variety of data types, including time series, and is arranged for very efficient read performance. IOx uses Parquet as its native persisted format, both for its impressive compression as well as to allow data created by IOx to be queried by the vast panoply of other tools that can read from the format.
Q4. What was behind the decision to build the database with Rust?
Rust is a systems language designed for speed, efficiency, reliability, and memory safety. InfluxData engineers have loved working with Go, and continue to do so for many of our projects, but when it came to IOx, it was critical to build the core using Rust for maximum efficiency.
For example, Rust gives us C/C++ levels of performance and control over runtime behavior and memory management, but with Go levels of safety, concurrency, and easy asynchronous programming. In addition, its modern ecosystem features such as the packaging system, crates.io, makes it easy to contribute and benefit from open source libraries. The Rust community is also famously welcoming and as it is one of the most loved languages in the stack overflow survey; using it attracted many talented developers to our company as well as to the ecosystem.
Q5. What are the key features of the new storage engine that are powered by these technologies?
The new storage engine delivers the following capabilities for developers.
- Fast Performance: IOx is the new purpose-built time series database optimized for fast ingest, schema on write, and automatic data migration to low-cost storage. It combines a hot, compressed in-memory datastore and a cold, low-cost object store for optimal write and query performance.
- SQL Support: The new query engine provides more options than ever to query data – API, Flux, InfluxQL, and SQL. Work faster with SQL by using popular tools like PSQL, Grafana, and Apache SuperSet as well as languages such as Flux and InfluxQL that are designed specifically for time series workloads.
- Unbounded Cardinality: By using state-of-the-art columnar database technologies, IOx eliminates cardinality limits imposed by the classic inverted indexes used for time series, to support metrics, events, logs, and traces.
6. You mentioned unbounded cardinality is a key new feature in the new storage engine. What new use cases does unbounded cardinality enable?
We removed cardinality limits so users can bring in massive amounts of time series data with unbounded cardinality. Classic data center monitoring use cases involve monitoring 10s to 100s of distinct things. However, use cases such as IoT metrics, events, traces, and logs increasingly involve capturing data for 10,000s to millions of distinct things such as Kubernetes container IDs, individual low-cost IoT devices, or tracing span IDs. Quickly ingesting this data in a cost-effective manner in a datastore unlocks monitoring, alerting, and analytics on such large fleets of devices. Together, these new capabilities allow developers to write any kind of event data with infinite cardinality and slice-and-dice data on any dimension without sacrificing performance.
Q7. Another feature is SQL support, bringing the popular language to InfluxDB for the first time. What does this mean for developers?
SQL support is another example of InfluxData’s commitment to meeting developers where they are. The number of tools and technology ecosystems that support SQL is massive, so by supporting SQL we allow developers to bring their existing tools and skills to time series data. Of course, we also support Flux and InfluxQL both for those who prefer languages optimized for time series as well as who have invested in those skills.
Qx. Anything else you wish to add?
I am very excited to be deploying the new storage engine. It is a massive leap forward for our core database technology and helps deliver on our vision that InfluxDB can be used for event data (i.e. irregular time series) as well as metric data (i.e. regular time series). We’re giving our users the ability to create time series on the fly from raw, high-precision data and by building on open source standards giving them unprecedented choice in the tools they can use.
What’s also incredibly exciting is our InfluxDB Cloud users are already being upgraded automatically behind the scenes. They don’t need to do anything or take any action – over time, they’ll automatically get new functionality, SQL compatibility, and unbounded cardinality so they’ll be able to write and query more volume and variety than ever before.
About Andrew Lamb
Andrew Lamb is a Staff Engineer at InfluxData, working on InfluxDB IOx, and a member of the Apache Arrow PMC. His experience ranges from startups to large multinational corporations and distributed open source projects, and has paid leadership dues as an architect and manager/VP. He holds an SB and MEng from MIT in Electrical Engineering and Computer Science.
Sponsored by InfluxData