On Learned Index Structures. Interview with Alex Beutel
” Learned indexes are able to learn from and benefit from patterns in the data and the workload. Most previous data structures were not designed to optimize for a particular distribution of data.” –Alex Beutel
I have interviewed Alex Beutel, Senior Research Scientist in the Google Brain SIR team. We talked about “Learned Index Structures“- data structures thought of as performing prediction tasks- their difference with respect to traditional index structures and their main benefits.
Q1. What is your role at Google?
Alex Beutel: I’m a research scientist within Google AI, specifically the Google Brain team. I focus on a mixture of recommender systems, machine learning fairness, and machine learning for systems. While these may sound quite different, I think they are all areas of machine learning application with unique, rich challenges and opportunities driving from understanding the data distribution.
Q2. You recently published a paper on so called Learned Index Structures . In the paper, you stated that Indexes (e.g B-Tree-Index, Hash-Index, BitMap-Index) can be replaced with other types of models, including deep-learning models, which you term learned indexes. Why do you want to replace well known Index-structures?
Alex Beutel: Traditional index structures are fundamental to databases and computer science in general, so they are important to study and have been deeply studied for a long time. I think whenever you can find a new perspective on such a well-studied area, it is worth exploring. In this case, we challenge the assumptions in data structure design by jumping from the more traditional discrete structures to continuous, stochastic components that can make mistakes. However, by taking this perspective, we find that we now have at our disposal a whole breadth of tools from the machine learning, data mining, and statistics communities that we can bring to bear on databases and more broadly data systems problems. Personally, rethinking these fundamental tasks with this new lens has been extremely exciting and fun.
Q3. What is the key idea for learned indexes?
Alex Beutel: The key idea for learned indexes is that many data structures can be thought of as performing prediction tasks, and as a result rather than building a discrete structure, use machine learning to build a model for the task .
Q4. What are the main benefits of learned indexes? Which applications could benefits from such learned indexes?
Alex Beutel: I want to separate what are the possible benefits and when or why can learned indexes realize those benefits. At a high level, using machine learned models lets us build data structures from a new broader set of tools. We have found that depending on the learned index configuration, we are able to get improvements in latency (speed), memory usage, and computational cost of running the index structure. Depending on the application, we can tune the learned index to get more savings in one or more of these dimensions. For example, in the paper we propose a hierarchical model structure, and we show that we can build a larger hierarchy and use more memory to get an even faster lookup or use a much smaller hierarchy to save memory and still not make the system too slow.
Why and when we are able to realize these benefits is a much more complicated question. One of the big advantages is that machine learning models make use of floating point operations which can be more easily parallelized with modern hardware, and with the growth of GPUs and TPUs, we may be able to build bigger and more accurate models without increasing latency.
Another aspect that I find exciting is that learned indexes are able to learn from and benefit from patterns in the data and the workload. Most previous data structures were not designed to optimize for a particular distribution of data. Rather, they often assume a worst-case distribution or ignore it entirely. But data structures aren’t being used in the abstract — they are being used on real data, which as we know from other areas of research, have many significant patterns. So one could ask, how can we make use of the patterns in the data being stored or processed to improve the efficiency of systems? ML models are extremely effective in adapting to those varying data distributions.
I think any application that is processing large amounts of data stands to benefit from taking this perspective. We focused on index structures in databases, but we have already seen multiple papers being published applying this perspective to new systems.
Q5. How can learned indexes learn the sort order or structure of lookup keys and use this signal to predict the position or existence of records?
Alex Beutel: B-Trees are already predicting the positions of records: they are built to give the block in which a record lies, and they do this just by processing the key. Learned indexes can do the same thing where they predict approximately where the record is. For example, if the keys are all even integers from 100 to 1000 (that is, key=100 has position 0, key=102 has position 1, key=104 has position 2, etc.), then the model f(key) = (key – 100)/2 will perfectly map from keys to positions. If the data aren’t exactly the even integers but on average we see one key every 2 spots (for example, keys: 100, 101, 105, 106, 109, 110, …) then f(key) above is still a pretty good model and for any key the model will almost find the exact position. Even if the data follow a more complicated pattern, we can learn a model to understand the distribution. It turns out that this is learning the cumulative distribution function, which has long been studied in statistics. This is exciting in that for those examples above, lookups become a constant-time operation, rather than growing with the size of the data; and more generally, this could change how we think about the complexity of these functions.
One challenge is that we can’t just return the approximate position; these data structures need to return the actual record being searched for. Typically, B-Trees will then scan through the block where the key is to find the exact right position. Likewise, when using a learned index, the model may not give the exact right position, but instead a close by one.
To return exactly the correct record, we search near the predicted position to find it; and the more accurate the model is, the faster the search will be.
Knowing if a record exists is quite different. Traditionally, Bloom filters have been used for this task; given a key, the Bloom filter will tell you if the key exists in the dataset, and if the key isn’t in the dataset the Bloom filter will mistakenly tell you it is with some small probability, called the false positive rate (FPR). This is a binary prediction problem: given a key, predict whether it’s in the dataset. Unlike traditional Bloom filters, we learn a model that tries to learn if there is some systematic difference between keys in the dataset and other questions (queries) asked of the Bloom filter. That is, if the dataset has all positive integers less than 1000, there is a trivial model g(key) := 1000 > key > 0 that can perfectly answer any query. If the dataset has all positive integers less than 1000 except for 517 then this is still a pretty good model with very few mistakes (FPR = 0.1%). If the dataset is malware URLs, these patterns are less obvious, but in fact lots of researchers have been studying what patterns are indicative of malware URLs (and distinguish them from normal webpage URLs), and we can build models to make use of these systematic differences.
From an accuracy perspective, Bloom filters have stringent requirements about no false negatives and low FPR, and so we build systems that combine machine learning classifiers and traditional Bloom filters to meet these requirements.
Q6. Under which conditions learned indexes outperform traditional index structures?
Alex Beutel: As mentioned above, I think there are a few key conditions for learned indexes being beneficial. First and foremost, it depends on the patterns of the data and workload being processed. In the range query case (B-Trees), if the data follow a linear pattern then learned indexes will easily excel; more complex data distributions may require more complex model structures which may not be okay for the application at hand. For existence indexes, the success of the model depends on how easily it can distinguish between keys in the dataset and real queries to the Bloom filter; distinguishing between even and odd integers is easy, but if the dataset is entirely random keys this will be very difficult.
In addition to making use of patterns in the data and workload, learned indexes depend on the environment they are being used in. For example, we study in-memory databases in our paper, and more recently we have found that disk-based systems require new techniques. For our learned Bloom filters we assume that saving memory is most important, but if there is a strict latency requirement, then the model design may need to change. If GPUs and TPUs were readily available, the learned index design would likely change dramatically.
Q7. What are the main challenges in designing learned index structures?
Alex Beutel: I think there are interesting challenges both in system design and in machine learning.
For systems, machine learned models provide much looser guarantees about accuracy than traditional data structures.
As a result, making use of ML models’ noisy predictions requires building systems that are robust to those errors.
In the B-Tree case we studied different local search strategies. For existence indexes we coupled the model with a Bloom filter to guarantee no false negatives. Interestingly, new research by Michael Mitzenmacher has shown that sandwiching the model between two Bloom filters does even better . I believe there are lots of interesting questions about (a) what is the right prediction task for machine learning models when incorporated in a system and (b) how should these models be safely integrated in the system.
On the machine learning side there are numerous challenges in building models that match the needs of these systems.
For example, most machine learning models are expected to execute on the order of milliseconds or slower; for learned indexes we often need the model to execute thousands of times faster. Tim Kraska, the first author on our paper, did a lot of optimizations for very fast execution of the model. In most of machine learning, overfitting is bad; for learned indexes that is not true — how should that change model design? How do I build model families that can trade-off memory and latency?
How do I build models that match the hardware they are running on, from parallelization to caching effects?
While these are challenges to making learned indexes work, they also present opportunities for interesting research from different communities working together.
Alex Beutel: We found some really great benefits. Depending on the use case learned indexes were able to be up to 3 times faster and in some cases use only 1% of the memory of a traditional B-Tree.
Q9. What is the implication of replacing core components of a data management system through learned models for future systems designs?
Alex Beutel: As I mentioned above, there have already been multiple papers applying these ideas to new core components, and we have been studying how to extend these ideas to a wide range of areas from indexing multidimensional data to sorting algorithms . We have seen similar opportunities and excitement in systems beyond databases, such as research for scheduling and caching.
My hope is that more folks building data management systems, and really any system that is processing data, think about if there are patterns in the data and workload the system is processing. Because there most likely are patterns, and I believe building new systems that can be customized and optimized for those patterns will greatly improve the systems’ efficiency.
Alex Beutel is a Senior Research Scientist in the Google Brain SIR team working on neural recommendation, fairness in machine learning, and ML for Systems. He received his Ph.D. in 2016 from Carnegie Mellon University’s Computer Science Department, and previously received his B.S. from Duke University in computer science and physics. His Ph.D. thesis on large-scale user behavior modeling, covering recommender systems, fraud detection, and scalable machine learning, was given the SIGKDD 2017 Doctoral Dissertation Award Runner-Up. He received the Best Paper Award at KDD 2016 and ACM GIS 2010, was a finalist for best paper in KDD 2014 and ASONAM 2012, and was awarded the Facebook Fellowship in 2013 and the NSF Graduate Research Fellowship in 2011. More details can be found at alexbeutel.com.
 Michael Mitzenmacher. A Model for Learned Bloom Filters, and Optimizing by Sandwiching. NeurIPS, 2018.
Stanford Seminar – The Case for Learned Index Structures. EE380: Computer Systems. Speakers: Alex Beutel and Ed Chi, Google, Published on Oct 18, 2018 (LINK to YouTube Video)
On Data, Exploratory Analysis, and R. Q&A with Ronald K. Pearson, ODBMS.org, April 13, 2018
On Apache Kafka®. Q&A with Gwen Shapira, ODBMS.org, March 26, 2018.
How to make Artificial Intelligence fair, transparent and accountable, ODBMS.org, January 27, 2018
Follow us on Twitter: @odbmsorg