On Software Quality. Q&A with Alexander Boehm and Greg Law.
Q1. What does it mean to improve software quality?
Alexander Boehm: Ultimately, the goal of any software company is to deliver software with the highest possible value to their customers. Looking at system software, such as database management systems, this easily becomes very complex with millions of lines of code and hundreds of contributors, as delivering software that is completely free of defects is virtually impossible. In this setting, the goal is to capture as many defects as possible as early in the development process as possible. Consequentially, every defect that is found and corrected before shipping the software to customers improves the quality of the software compared to its initial state. Finding these defects early not only increases customer satisfaction, but also helps to reduce development costs, as fixing defects after a release entails expensive quality measures such as customer notifications, emergency bugfix releases, etc.
Q2. How do you ensure code quality in SAP HANA?
Alexander Boehm: SAP HANA uses an enterprise-grade quality assurance process based on hierarchical continuous integration and a high amount of tooling and automization.
This includes industry best practices such as code review, code inspection tools, pre-submit functional tests, continuous performance and stress testing, end-user acceptance tests, and many more. We were also among the first to integrate capacity testing (performance and scalability) in the early stages of the CI pipeline – and to publish about it (in Kim-Thomas Rehmann, Changyun Seo, Dongwon Hwang, Binh Than Truong, Alexander Boehm, Dong Hun Lee: Performance Monitoring in SAP HANA’s Continuous Integration Process. SIGMETRICS Performance Evaluation Review 43(4): 43-52 (2016)).
Q3. How do you normally perform automated testing within SAP HANA?
Alexander Boehm: Our CI pipeline includes pre-submit functional and performance testing for every (subtopic) branch merge that is pushed to mainline. The test results are integrated into the code review tools, providing automated feedback to developers on the (functional and non-functional) quality about the change set. If all tests are passed, the change automatically advances to mainline.
Q4. What are the typical challenges for SAP developers to find and fix errors in the SAP HANA codebase?
Alexander Boehm: While our functional and performance tests have a very high coverage of the HANA codebase (most of the code is heavily tested using extremely fast unit tests), there are some complex failure situations (e.g. missing / wrong synchronization primitives, performance hotspots) that depend on specific boundary conditions such as a particular interleaving of concurrent operations, other timing related issues, or even defects/anomalies in the underlying hardware or OS that are hard to test and reproduce. Often, these situations need to be reproduced in order to add additional traces to uncover the root cause, or have the problematic code in the debugger for further analysis.
Q5. Why do you use Fuzz testing? What are the benefits?
Alexander Boehm: We strongly believe in test-driven development and high code coverage, but also believe that it is hard to come up with all possible combinations of input data that might trigger defects in the code. Specifically, the story of our colleagues from sqlite with a remarkable test effort and quality metrics (e.g. 100% branch coverage ), and their experience with fuzz testing finding an equally remarkable number of severe defects quickly motivated us to integrate fuzz testing into the CI pipeline as well. We strongly believe that fuzz testing and all kinds of stress tests in general can help to add yet another channel to discover defects before shipping – which is one of our highest priorities.
Q6. What is Undo’s record, rewind and replay technology?
Greg Law: Undo’s technology is kind of like a Flight Recorder for software. It captures the flow and the data of a program as it executes, creating a recording that can then be rewound and replayed to any point in its execution history. You can roll back to any instruction that executed, and see the full program state (all of memory, registers, etc) for any point in its history. Features such as reverse watchpoints allow you to wind back to the exact moment a piece of data was last modified (invaluable when tracking down memory corruption or race conditions).
Analysing a crash or a failed test is all about figuring out exactly what the software did – where it differed from your expectations. This technology takes away all the guesswork that is conventionally required, and allows the developer to see with ease exactly what happened, when and why.
Q7. Why is it relevant for SAP HANA?
Alexander Boehm: There are some classes of bugs such as sporadic defects caused by race conditions (see above) that are extremely hard to reproduce. Often, these defects are only uncovered by long-running, highly concurrent stress tests that are non-deterministic by nature and extremely hard to reproduce. We heavily invest into these stress tests as well, with dedicated hardware clusters running massively parallel scenarios for multiple hours and days. Instead of wasting days or even weeks in trying to reproduce defects found in such scenarios with a debugger attached, we can directly run the stress tests using Undo Live Recorder: In case a defect is found then, we can immediately use the recording to analyze the defect, and often fix it within a few minutes or hours.
Q8. How do you measure if the quality of software has improved in SAP HANA when using Undo’s Live Recorder?
Alexander Boehm: With Undo Live Recorder, we were able to dramatically cut down the analysis time that is required to understand the root cause of very complex software defects. As a result, Live Recorder does not directly help to improve the software quality, but to (dramatically) cut down the time that is necessary to reproduce defects and fix them. As such, Live Recorder is more about significantly reducing development costs by making developers more productive (and making software development more fun by wasting less time on reproducing complex bugs) than improving the quality, which is still the job of the developers.
Q9. What are the benefits you have obtained in working together with Undo?
Alexander Boehm: As mentioned above, the time to reproduce complex software defects and complexity to analyze these scenarios was reduced significantly. This allows our developers to focus on software development and adding value for our customers, instead of spending time on reproducing defects in complex scenarios.
Dr. Alexander Boehm is a database architect working on SAP´s HANA in-memory database management system.
His focus is on performance optimization and holistic improvements of enterprise architectures, in particular application server/DBMS co-design. Prior to joining SAP, he received his PhD from the University of Mannheim, Germany, where he worked on the development of efficient and scalable applications using declarative message processing.