On Software Failure Replay. Q&A with Greg Law
Greg Law is the co-founder and CTO of Undo. He is a coder at heart, but likes to bridge the gap between the business and software worlds. Greg has 20 years’ experience in the software industry prior to founding Undo and has held development and management roles at companies, including the pioneering British computer firm Acorn. Greg holds a PhD from City University, London and was nominated for the 2001 British Computer Society Distinguished Dissertation Award.
Q1. Shift-left testing – finding and fixing failures earlier in the development pipeline – is becoming standard practice for many engineering teams, but they are still relying on traditional methods of failure reproduction in order to fix them. What is the problem?
Shift-left testing enables teams to focus on problem prevention, and it is a really important piece of the puzzle. It allows developers to begin testing earlier in the build cycle where software failures are less costly to identify and fix. The expected result is better quality software shipped on time; however, as you said, it still relies on developers to manually reproduce each identified failure using the same methods they have always used.
But the critical thing with shift left strategies that many overlook is it only really delivers when your tests are rock solid and totally reliable. Even small unit tests can fail 1 time in a million, especially in multithreaded code, and developers need to track down every last failure. And of course, shift left doesn’t mean rely 100% on unit tests – you still need integration tests, fuzzing, etc, and these often result in difficult to reproduce failures. No matter how shifted left our testing is, there will always be failures not found until pre-production or even production – that’s just the nature of software.
Q2. According to your own research, developers can spend up to 50% of their time reproducing failures and cost enterprises companies millions of dollars annually. Can you please elaborate on this?
Absolutely. We found that the amount of time developers spend debugging, in North America alone, costs the enterprise software market upwards of $61 billion annually; this equals a staggering 620 million developer hours per year. Imagine the additional innovation we could deliver with just a fraction of that time back!
Q3. If you can’t reproduce the failure, teams basically give up and ship the software with known defects as they have to still meet deadlines. What is the solution?
Because meeting deadlines are incentivized, many development teams set aside or delegate debugging. Worse, some just give up and deploy the software knowing there are defects. Often the backlog of known defects or undiagnosed flaky failing tests just grows and grows. There certainly isn’t much emphasis placed on building a better mousetrap to solve the issue of reproducibility.
I believe that a new class of debugging technology called Software Failure Replay will prove to be critical for development teams aiming to reduce bug fix time and therefore accelerate software delivery.
Q4. What is Software Failure Replay?
Software Failure Replay (SFR) is a method of recording the execution of a software program as it fails and replaying the recording file forwards and backwards to quickly identify the root cause of the issue.
The recording captures bugs in the act. And when you replay it, you get to see exactly what your program did before it failed or behaved unexpectedly. Once you have the recording file, you no longer have to worry about trying to reproduce the failure again – it’s as if you have a 100% reproducible test case which you can replay any time, anywhere.
It’s a bit like using a security footage to resolve a murder mystery. You’ll solve the crime in a fraction of the time compared with trying to piece multiple clues together.
Q5. Can you give us an example of how you generate a recording of the failures?
The simplest method of capturing a recording is to just launch the failing application from the Undo LiveRecorder platform. This will capture everything the application does up until the point of failure, and then save the recording to disk for the subsequent replay and analysis of exactly what led up to the failure.
You can also connect LiveRecorder to an already running application and start recording from that point onwards. Alternatively, you can integrate LiveRecorder into the application binary as a library, which enables programmatic control to start/stop recordings automatically, or as directed through user interaction.
Q6. How does this help solve the issue of reproducibility?
When replaying a recording, developers can rewind the recording to any program state. This provides them with full visibility of every memory location (including heap, stack, registers, program counters, system calls etc). The user has DVR-like controls to step forward/backwards, rewind, and freeze-frame code.
This enables developers to quickly identify the root cause of most software defects such as race conditions, corruption defects, memory leaks, socket leaks, stack corruptions etc. No more “how did that happen??” – now you can just see. This isn’t an especially new idea, but what is new is the new generation of technology that can do this on commodity hardware (e.g. no specialised trace hardware needed) and with low overheads so that it works on real-world complex applications.
As a result, root cause detection time is significantly reduced so developers can get straight to debugging the recording artifact – reducing the number of loops in agile development cycles and therefore increasing development velocity.
Q7. How does it help to deploy fixes faster?
Simply put, the majority of time is spent finding the bug. Once it is reproduced, fixing it is relatively easy. If you can cut down the amount of time spent fixing bugs, you can deploy a fix in significantly less time. This can be even more important fixing production failures – we’ve all been there where that customer is screaming because they keep getting hit by this bug again and again, and the dev team can’t reproduce it, can’t get a handle on it. Now if you can get a recording, it becomes almost trivial, and your customers have their issue fixed almost as soon as they’ve reported it.
Q8. Reproducing the failure is the single biggest challenge as it relies on guesswork and luck. Do you agree?
Well, traditional methods of reproducing the failure absolutely rely heavily on guesswork. As we like to call it, a game of 21 questions. But when you are able to generate a 100% reproducible test case (i.e the recording file), all of that guesswork goes out the window and you can finally get a certain level of certainty around how long it’ll take to diagnose and fix the problem.
Qx anything else you wish to add?
Where and how Software Failure Replay is employed is dependent upon the specific needs of the software engineering team. Our customers use our Software Failure Replay platform, LiveRecorder, across all stages of the development lifecycle. They use it either to unblock their CI/CD pipeline in development/test, as well as resolve customers issues faster in production.
The common thread throughout is the ability to reproduce the failure faster and deploy a fix once – and fast – cutting out all the waste around debugging. It’s funny, debugging totally dominates software development, but we almost never talk about it – it’s like the pain is so bad we’ve just learnt to accept it. Software never works first time. Most developers spend most of their time finding and fixing bugs (maybe before they’ve even merged the code, but still). Kernighan has this great quote that debugging is twice as hard as writing the code in the first place, so if you’re as clever as you can be when you write it how will you ever debug it? Now he meant when he said that keep it simple, but it has an interesting corollary: debuggability of code is the limiting factor for software development. Whatever your metric for good is – how fast the code runs, how quickly you can write it, how small it is, how extensible it is, whatever – if you make it twice as easy to debug the code your code will be twice as good.
Sponsored by Undo.