On HeatWave MySQL: Query Execution, Performance, Benchmarks, and Vector type. Q&A with Kaan Kara.

Q1. You are working as a lead developer mainly responsible for query execution in HeatWave MySQL. What are the challenges that you try to solve? 

HeatWave is an in-memory data processing engine, so for us memory efficiency is extremely important to optimize. Our execution engine employs many advanced algorithms to achieve this efficiency, offering a very high data size to memory usage ratio. For instance, we can run a 1TB TPC-H benchmark using only 512GB memory. Just to give an example of one of the important algorithms that makes this possible, it is our distributed bloom-filter enhanced join. We build SIMD-friendly bloom-filters and apply them in a distributed manner. This allows us to filter data before we even start to process any, effectively reducing the amount of intermediate data that we need to process upstream. Another important area for us is of course query execution performance. The HeatWave engine is designed from ground-up to process data at a massive scale, employing thousands of cores to process a single query. We are constantly working on improving and taking advantage of this aspect as well. Recently, we have introduced execution-time cardinality estimators to facilitate dynamic execution decisions, such as correctly sizing a hash-table for our relational operators.

Q2. In your experience, what are the key features that substantially improve the performance and the memory efficiency of a query execution engine? 

When it comes to performant and efficient query execution, especially in a massively distributed system such as HeatWave, many aspects come into play: From join order optimization to logical transformations, to physical compilation, and finally the scale-out parallel execution. Every step of this process can make or break the execution efficiency. So, in HeatWave, we need to take a holistic approach to solve this very complex problem. That usually means that implementing an execution-side optimization by itself is not enough. After introducing this optimization, we need to integrate it carefully with our query optimizer as well as logical and physical query compilers. To give an example: the runtime cardinality estimators I mentioned earlier; it is not just an execution-side optimization, the physical compiler makes the decision about when to utilize these estimators, depending on the accuracy of our existing statistics. Then, our join order optimizer needs to be aware of it to accurately cost join subtrees, potentially leading to different join orders. So, as you can imagine, it is really this holistic approach that we take when implementing such optimizations that makes HeatWave so good as a data processing query engine.

Q3. You have been working on bloom-filter enhanced distributed joins. What does it mean? And what is it useful for? 

Simply put, bloom-filters are approximate data structures that tell you whether a key is present in a set or not. Thanks to their approximate nature, they are usually much smaller and more efficient to work with than exact lookups. For low-cardinality joins, where only a small subset of rows is matching from both sides of the join, they are perfect at prefiltering results before the actual join takes place. The prefiltering becomes especially beneficial in distributed settings. The reason for that is that in a distributed setting, you need to partition data before the join. The partitioning itself is a costly operation, since it is shuffling data across multiple nodes in the cluster, over the network. The good news is, the prefiltering by the bloom-filter can happen even before the partitioning, potentially saving massive amounts of work. So, it both increases memory efficiency and improves performance at the same time. In HeatWave, we implemented a SIMD-friendly bloom-filter, which further makes the lookups super-efficient thanks to our utilization of SIMD instructions.

Q4. Are these improvements leading to factors of geomean reduction in analytical benchmarks, such as TPC-H and TPC-DS? 

Definitely. There are many queries in these benchmarks that benefit from these kinds of optimizations. To give a concrete example, TPC-H q21 reduced its runtime by 3x while utilizing 10x less memory after we started employing the bloom-filters discussed above. We are constantly introducing such advanced optimizations to make HeatWave even faster. Another one I can mention is a feature we call lazy decompression: Our base relation data is compressed; that means we need to decompress it during execution. In a nutshell, with lazy decompression we push down our filtering below the decompression, so that we can skip decompressing data not needed by the query. Better yet, the filters that can be pushed down are not just normal filters, even bloom-filters can be pushed down to facilitate lazy decompression. So, that is neat, giving us again very good improvements for TPC-H and TPC-DS benchmarks.

Q5. What are the consequence of reducing the memory requirements of the in-memory execution engine?    

The most important aspect of reducing memory requirements is reducing the resources needed per base relation data size. By implementing such features, we basically lower the minimum number of resources that our customers need to provision for their workloads, giving them more choice in what they optimize for.

Q6. You are the lead developer who introduced the new VECTOR type to MySQL, along with highly optimized vector processing functions within HeatWave. Please tell us a bit about it? How does it differ MySQL with a VECTOR type from a dedicated Vector database? 

Vector databases have become prevalent for retrieval-augmented-generation (RAG) that LLM applications perform. In summary, vector databases enable semantic search based on prompts, so that the LLM can be fed with relevant information from a database. As a result, it can generate more accurate answers, reducing the chances of hallucinations. Many dedicated vector databases have popped up recently thanks to the popularity of RAG. They focus mainly on efficient semantic search. Yet, they miss fundamental data management features such as efficient ACID-compliant transactions, high performance analytics, high-availability via replication, etc. These are such important aspects of data management which have been developed over multiple decades by relational databases. In essence, it is much easier to add vector processing and semantic search capability to a relational database than adding fundamental data management features to a dedicated vector database. And this is what we have recently done by adding the VECTOR type to MySQL and HeatWave. 

Q7. What are these highly optimized vector processing functions within HeatWave? and what are they useful for? 

The vector processing functions calculate a distance between two vectors and find out which ones are closer to each other in the embedding space. This accomplishes the semantic search, to find out which vectors are closer to a query vector. We have implemented these distance functions using SIMD capabilities and show that we can saturate DRAM bandwidth when we perform the search. Being an in-memory engine, this also makes the usage of indexes unnecessary most of the time. Serving high performance lookups without an index has benefits such as providing exact accuracy instead of approximate results and avoiding the work to both build and maintain indexes, while allowing high throughput transaction processing, see this blog post.

Q8. Will you suggest using HeatWave for semantic search and retrieval-augmented generation? 

Definitely. Just to give an example of what this enables users to achieve which is not possible in dedicated vector databases: Users can run complex analytics queries, joining multiple tables, performing group-by or window function operations and combine those queries with semantic search. All this functionality and flexibility of SQL, running on a highly efficient relational data processing engine such as HeatWave, it opens unprecedented opportunities for RAG applications, which combine semantic search with advanced real-time analytics.

Q9. Prior to joining Oracle, you have received a doctoral degree in 2020 from ETH Zurich, Systems Group in Computer Science Department. What was your research focused on? and how does it influence (if any) your work at Oracle? 

My research was focused on using reconfigurable hardware devices such as FPGAs to accelerate data analytics and machine learning operations. We showed in many instances that dedicated hardware can accelerate important operations, such as high-fanout data partitioning. The flexibility you get with the capability of designing custom hardware for a particular task is unparalleled, in terms of the efficiency and optimizations you can achieve. My experience in that field influences my work at HeatWave definitely, especially thinking about optimizations from a computer architecture perspective. Efficient data processing systems need to be designed from ground up to utilize the hardwave as optimally as possible, so oftentimes it is important to wear the computer-architect hat even when designing software-only solutions.

………………………………………………..

Kaan Kara

Kaan is a principal member of technical staff at Oracle, working as a lead developer mainly responsible for query execution in HeatWave MySQL.

As part of the HeatWave team, he has led multiple projects that substantially improved the performance and the memory efficiency of the query execution engine. A sample of the projects include pipelined relational operator execution, bloom-filter enhanced distributed joins, base relation compression, and late decompression optimizations. Collectively, these improvements led to factors of geomean reduction in analytical benchmarks, such as TPC-H and TPC-DS, while reducing the memory requirements of the in-memory execution engine, enabling a single HeatWave node with 512GB memory to run the 1TB TPC-H benchmark in full.

More recently, he was the lead developer introducing the new VECTOR type to MySQL, along with highly optimized vector processing functions within HeatWave, laying the data layer foundation that enabled highly anticipated vector store features within HeatWave, such as semantic search and retrieval-augmented generation.

Prior to joining Oracle, Kaan received his doctoral degree in 2020 from ETH Zurich, Systems Group in Computer Science Department. His research focused on using reconfigurable hardware devices (FPGAs) to accelerate data analytics. He has published papers in top database venues such as VLDB and SIGMOD, showcasing the potential benefit of FPGA-based implementations for data partitioning and in-database machine learning tasks.

Sponsored by MySQL/Oracle

You may also like...