The future of data management: “Disk-less” databases? Interview with Goetz Graefe
“With no disks and thus no seek delays, assembly of complex objects will have different performance tradeoffs. I think a lot of options in physical database design will change, from indexing to compression and clustering and replication.” — Goetz Graefe.
Are “disk-less” databases the future of data management? What about the issue of energy consumption and databases? Will we have have “Green Databases”?
I discussed these issues with Dr. Goetz Graefe, HP Fellow (*), one of the most accomplished and influential technologists in the areas of database query optimization and query processing.
Hope you`ll enjoy this interview.
Q1 The world of data management is changing. Service platforms, scalable cloud platforms, analytical data platforms, NoSQL databases and new approaches to concurrency control are all becoming hot topics both in academia and industry. What is your take on this?
Goetz Graefe: I am wondering whether new and imminent hardware, in particular very large RAM memory as well as inexpensive non-volatile memory (NV-RAM, PCM, etc.) will be significant shifts that will affect software architecture, functionality, scalability, total cost of ownership, etc.
Hasso Plattner in his keynote at BTW (” SanssouciDB: An In-Memory Database for Processing Enterprise Workloads”) definitely seemed to think so. Whether or not one agrees with everything he said, he traced many of the changes back to not having disk drives in their new system (except for backup and restore).
For what it’s worth, I suspect that, based on the performance advantage of semiconductor storage, the vendors will ask for a price premium for semiconductor storage for a long time to come. That enables disk vendors to sell less expensive storage space, i.e., there’ll continue to be role for traditional disk space for colder data such as long-ago history analyzed only once in a while.
I think there’ll also be a differentiation by RAID level. For example, warm data with updates might be on RAID-1 (mirroring) whereas cold historical data might be on RAID-5 or RAID-6 (dual redundancy, no data loss in dual failure). In the end, we might end up with more rather than fewer levels in the memory hierarchy.
Q2. What is expected impact of large RAM (volatile) memory on database technology?
Goetz Graefe: I believe that large RAM (volatile) memory has already made a significant difference. SAP’s HANA /Sanssouci/NewDB project is one example,
C-store/VoltDB is another. Others database management system vendors are sure to follow.
It might be that NoSQL databases and key-value stores will “win” over traditional databases simply because they adapt faster to the new hardware, even if purely because they currently are simpler than database management systems and contain less code that needs adapting.
Non-volatile memory such as phase-change memory and memristors will change a lot of requirements for concurrency control and recovery code. With storage in RAM, including non-volatile RAM, compression will increase in economic value, sophistication, and compression factors. Vertica, for example, already uses multiple compression techniques, some of them pretty clever.
Q3. Will we end up having “disk less” databases then?
Goetz Graefe: With no disks and thus no seek delays, assembly of complex objects will have different performance tradeoffs. I think a lot of options in physical database design will change, from indexing to compression and clustering and replication.
I suspect we’ll see disk-less databases where the database contains only application state, e.g., current account balances, currently active logins, current shopping carts, etc. Disks will continue to have a role and economic value where the database also contains history, including cold history such as transactions that affected the account balances, login & logout events, click streams eventually leading to shopping carts, etc.
Q4. Where will the data go if we have no disks? In the Cloud?
Goetz Graefe: Public clouds in some cases, private clouds in many cases. If “we” don’t have disks, someone else will, and many of us will use them whether we are aware of it or not.
Q5. As new developments in memory (also flash) occur, it will result in possibly less energy consumption when using a database. Are we going to see “Green Databases” in the near future?
Goetz Graefe: I think energy efficiency is a terrific question to pursue. I know of several efforts, e.g., by Jignesh Patel et al. and Stavros Harizopoulos et al. Your own students at DBIS Goethe Universität just did a very nice study, too.
It seems to me there are many avenues to pursue.
For example, some people just look at the most appropriate hardware, e.g., high-performance CPUs such as Xeon versus high-efficiency CPUs such as Centrino (back then). Similar thoughts apply to storage, e.g., (capacity-optimized) 7200 rpm SATA drives versus (performance-optimized) 15K rpm fiber channel drives.
Others look at storage placement, e.g., RAID-1 versus RAID-5/6, and at storage formats, e.g., columnar storage & compression.
Others look at workload management, e.g., deferred index maintenance (& view maintenance) during peak load (perhaps the database equivalent to load shedding in streams) or ‘pause and resume’ functionality in utilities such as index creation.
Yet others look at caching, e.g., memcached. Etc.
Q6. What about Workload management?
Goetz Graefe: Workload management really has two aspects: the policy engine including its monitoring component that provides input into policies, and the engine mechanisms that implement the policies. It seems that most people focus on the first aspect above. Typical mechanisms are then quite crude, e.g., admission control.
I have long been wondering about more sophisticated and graceful mechanisms. For example, should workload management control memory allocation among operations? Should memory-intensive operations such as sorting grow & shrink their memory allocation during their execution (i.e.,, not only when they start)? Should utilities such as index creation participate is resource management? Should index creation (etc.) support ‘pause and resume’ functionality?
It seems to me that I’d want to say ‘yes’ to all those questions. Some of us at Hewlett-Packard Laboratories have been looking into engine mechanisms in that direction.
Q7. What are the main research questions for data management and energy efficiency?
Goetz Graefe: I’ve recently attended a workshop by NSF on data management and energy efficiency.
The topic was split into data management for energy efficiency (e.g., sensors & history & analytics in smart buildings) and energy efficiency in data management (e.g., efficiency of flash storage versus traditional disk storage).
One big issue in the latter set of topics was the difference between traditional performance & scalability improvements versus improvements in energy efficiency, and we had a hard time coming up with good examples where the two goals (performance, efficiency) differed. I suspect that we’ll need cases with different resources, e.g., trading 1 second of CPU time (50 Joule) against 3 seconds of disk time (20 Joule).
NSF (the US National Science Foundation) seems to be very keen on supporting good research in these directions. I think that’s very timely and very laudable. I hope they’ll receive great proposals and can fund many of them.
Dr. Goetz Graefe is an HP Fellow, and a member of the Intelligent Information Management Lab within Hewlett-Packard Laboratories. His experience and expertise are focused on relational database management systems, gained in academic research, industrial consulting, and industrial product development.
His current research efforts focus on new hardware technologies in database management as well as robustness in database request processing in order to reduce total cost of ownership. Prior to joining Hewlett-Packard Laboratories in 2006, Goetz spent 12 years as software architect in product development at Microsoft, mostly in database management. Both query optimization and query execution of Microsoft’s re-implementation of SQL Server are based on his designs.
Goetz’s areas of expertise within database management systems include compile-time query optimization including extensible query optimization, run-time query execution including parallel query execution, indexing, and transactions. He has also worked on transactional memory, specifically techniques for software implementations of transactional memory.
Goetz studied Computer Science at TU Braunschweig from 1980 to 1983.
(*) HP Fellows are “pioneers in their fields, setting the standards for technical excellence and driving the direction of research in their respective disciplines”.