On Columnar Storage. Q&A with Benjamin Deboe
” With Columnar Storage, we flip the storage on its side and organize data by column, in handy chunks of 64,000 values and encode them in a dedicated vector format that optimizes for individual datatypes. “
Q1. Columnar Storage is now available in early access with InterSystems IRIS. What is it?
Columnar Storage is a new storage option for relational tables, one of the primary data models supported by InterSystems IRIS. In the past, InterSystems IRIS would store relational table data in a row-by-row layout by default, but with InterSystems IRIS 2022.2, you can choose to have it physically organized by column. This is a simple table-level setting that’s entirely transparent to applications or query tools accessing the tables, yet it offers order-of-magnitude performance advantages for analytical query workloads.
Q2. What are the benefits of this storage option for SQL tables?
In two words: faster analytics! Columnar Storage optimizes the physical layout of your data on disk and in memory, as well as how it’s handled at runtime, to ensure lightning-fast analytical querying. The efficiency gains include not only shorter response times for individual queries, but also less IO, better cache reuse and therefore an improved overall system throughput.
In addition to exceptional speed, there are also notable savings in the disk storage required to store your tables. The numbers will vary based on your data’s characteristics and may appear small when compared to our traditional storage, as it already dealt very efficiently with sparse data. Nonetheless, it’s a net win especially when you compare it to the footprint of file sources.
Q3. What kind of use cases benefit from the Columnar Storage and what use cases do not?
While we believe we implemented it in a really cool way, Columnar Storage as a concept was not invented at InterSystems. For many years, there’s been an entire market segment of columnar-organized databases, with household names such as Vertica, Clickhouse, Snowflake, and Amazon Redshift. This market segment has focused squarely on analytical use cases, and that’s where our columnar storage option belongs too: queries that need to scan large numbers of rows, typically with complex aggregations and groupings. Data usually gets inserted and archived in bulk, with few updates if any. These technologies are often deployed in Data Warehouse, Data Mart and similar use cases.
This domain is very different from operational use cases such as an order booking system, a financial trade management solution, or a call center app, where you’re only looking up and working with small numbers of rows in any given query and transactions are crucial to keep the data consistent. If you look at the database market as a whole, you’ll quickly see that most vendors (and especially in the cloud) still target either operational or analytical use cases. That’s because this low-level storage layout is so fundamental to the way a database reads and writes data, implicitly and explicitly defining the patterns it’s good at.
Q4. Specifically, how does Columnar Storage work in practice?
For those not too familiar with the InterSystems IRIS Data Platform, let me briefly sketch its architecture. At the core of the platform is a highly optimized key-value style data engine with its own language runtime in which the only difference between a local variable and a persisted one is a single character at the start of the variable name. These persistent variables, also known as “globals”, are organized as multidimensional arrays and cached automatically by the engine. The cache can be distributed and kept consistent across compute nodes, offering an incredibly simple scale-out option in which application code can interact with data persisted on another node with the same ease as if it were a local variable. It’s only on top of this infrastructure that we offer our relational access layer for SQL applications, next to other supported data models such as Objects, Documents, and more.
Columnar Storage, and the earlier row-based format, define how your relational data in SQL is persisted in globals. For row storage, each row is encoded in a flexible list format that supports any scalar datatype and goes into a separate global node. This implies only a single IO operation is needed to read or write a row and ensures very fast transactions. For analytical queries scanning millions of rows, obviously, the number of IO operations starts adding up, and in addition, you still need to extract the requested column values out of the full row’s list encoding.
With Columnar Storage, we flip the storage on its side and organize data by column, in handy chunks of 64,000 values and encode them in a dedicated vector format that optimizes for individual datatypes. For example, string vectors will use a dictionary encoding by default, while integer vectors automatically adapt the number of bytes required to store each value based on the highest value in the chunk. These vectors come with a whole series of dedicated operations that leverage SIMD instructions which we use in query processing for extremely efficient “vectorized” query plans that push whole chunks of data through a chipset-level pipeline rather than operate on individual row values in an interpreted language.
Q5. Why Columnar Storage enables order-of-magnitude faster analytical queries?
The design elements I just described, stretching across storage and compute, are what’s delivering these spectacular gains. When we run low-level profiling tools during analytical query execution, we see most of the time is spent exactly where we want it to: in the SIMD-optimized vector operations that build aggregates, filters and whatnot, operating of 64,000 data values at a time. That means that everything else, including typical bottlenecks such as IO, unpacking values and memory access, are no longer constraining performance and we’re effectively only bound by the CPU cores the system has access to.
Q6. Is Columnar Storage compatible with the traditional Relational SQL storage?
This is a great question as many of our competitors would struggle to answer it to satisfaction! Of course, the answer is yes, as we engineered this to be just a storage setting and otherwise blend in 100% with the rest of the SQL engine.
Our heritage helped us a bit here. When we first designed our SQL engine a few decades ago, we already had a large number of customers that had architected their applications based on custom globals structures (think key-value with composite keys and complex values). In order to offer them SQL access to this data, we needed to design a very expressive projection model and a really smart query optimizer. The flexible concepts for master data and index maps we put in place have now really paid off when architecting Columnar Storage. They allowed us to not only offer it as a table-level option but also refine it to the column level or index type for row-organized tables. This flexibility is unique in the market, and it’s all transparent to applications reading or writing data to the tables.
Q7. If you choose to store some data in a classical row-based layout and some data in columns, how do you handle transactions on both sets of data?
Our infrastructure for managing transactions operates at that lower-level storage concept we call globals. We implemented atomic updates for vectors to keep transactions fast, and we get all the ACID properties for free thanks to our architecture.
Q8. Is it possible to join the Early Access Program? How?
Yes. The Early Access Program launched last year at our annual conference where we announced the new feature, and since then we’ve regularly released updated kits incorporating enhancements, fixes and feedback we learned through the program. Participants get a free temporary license to try the new technology on InterSystems IRIS, a first demo script to get you started and tailored advice on how to leverage it with their own data. Please register at https://www.intersystems.com/columnar-storage or reach out to me on LinkedIn or our Developer Community.
Q9. Anything else you wish to add?
Thank you for the opportunity to talk about this new capability! We’ve been working on it for quite a while now with a large and diverse team, and as such are really excited to see more and more uptake through the Early Access Program. And in case you’re not as adventurous when it comes to software compared to say traveling or food, Columnar Storage is set to become a GA feature by the end of the first quarter of 2023, as part of the InterSystems IRIS 2023.1 release.
System Development, Product Manager, InterSystems
Benjamin is a product manager in the Data Platforms group at InterSystems, looking after the areas of scalability and analytics. He joined InterSystems in 2010 as part of the iKnow acquisition and has worked with various database technologies, mostly in the areas of data warehousing, natural language processing and anything analytics.
Sponsored by InterSystems