On C++ Debugging. Interview with Greg Law
“Like it or not, debugging is part of programming. There is a lot of research and cool technology about preventing bugs (programming language features or design decisions that make certain bugs impossible) or catching bugs very early (through static or dynamic analysis or better testing), and all this is of course laudable and good stuff. But I’ve often been struck by how little attention is placed on making it easier to fix those bugs when they inevitably do happen.” — Greg Law
Q1: You are a prolific speaker at C++ conferences and podcasts. In your experience, who is still using C++?
Greg Law: C++ is used widely and its use is growing. I see a lot of C++ usage in Data Management, Networking, Electronic Design Automation (EDA), Aerospace, Games, Finance, etc.
Q2: In my interview with Bjarne Stroustrup last year, he spoke about the challenge of designing C++ in the face of contradictory demands of making the language simpler, whilst adding new functionality and without breaking people’s code. What are your thoughts on this?
Greg Law: I totally agree. I think all engineering is about two things – minimising mistakes and making tradeoffs (i.e. judgements). Mistakes might be a miscalculation when designing a bridge so that it won’t stand up or an off-by-one error in your program – those are clearly undesirable, we don’t want those. A tradeoff might be between how expensive the bridge is to build and how long it will last, or how long the code takes to write and how fast it runs.
But tradeoffs are relevant when it comes to reducing errors too – what price should we pay to avoid errors in our programs? How much extra time are we prepared to spend writing or testing it to get the bugs out? How far do we go tracking down those flaky 1-in-a-thousand failures in the test-suite? Are we going to sacrifice runtime performance by writing it in a higher-level and less error-prone language? Alternatively, we could choose to make that super-clever optimisation about which it’s hard to be confident it is correct today and even harder to be sure it will remain correct as the code around it changes; but is the runtime performance gain worth it, given the uncertainty that has been introduced? It’s counterintuitive, but actually there is an optimal bugginess for any program – we inevitably trade off cost of implementation and performance against potential bugs.
It’s probably fair to say however that most programs have more bugs than is optimal! I think it’s also true that human nature means we tend to under-invest in dealing with the bugs early, particularly flaky tests. We always feel “this week is particularly busy, I’ll part that and take a look next week when I’ll have a bit more time”; and of course next week turns out to be just as bad as this week.
Q3: I understand Undo helps software engineering teams with debugging complex C/C++ code bases. What is the situation with debugging C/C++? What are you seeing on the ground?
Greg Law: Like it or not, debugging is part of programming. There is a lot of research and cool technology about preventing bugs (programming language features or design decisions that make certain bugs impossible) or catching bugs very early (through static or dynamic analysis or better testing), and all this is of course laudable and good stuff. But I’ve often been struck by how little attention is placed on making it easier to fix those bugs when they inevitably do happen. The situation is not unlike medicine in that prevention is better than cure, and the earlier the diagnosis the better; but no matter what we do, we will always need cure (unlike medicine we have the balance wrong the other way round – in medicine we spend way too much on cure vs prevention!).
It’s all about tradeoffs again. All else being equal, we’d ensure there are no bugs in the first place; but all else never is equal, and how high a price can we afford on prevention? And actually if you make diagnosis and fixing cheaper, that further reduces how much you need to spend on prevention.
The harsh reality is that close to none of the software out there today is truly understood by anyone. Humans just aren’t very good at writing code, and economic pressure and other factors mean we add and fix tests until our fear of delivering late outweighs our fear of bugs. This is compounded as code ages; people move on from the project, bugs get fixed by adding a quick hack, further increasing the spaghettification. Like frogs in boiling water, we’ve kind of become so used to it that we don’t notice how awful it is any more!
People routinely just disable flaky failing tests because they can’t root-cause them. Over a third of production failures can be traced back directly or indirectly to a test that was failing and was ignored.
Q4: You have designed a time travel debugger for C/C++. What is it for?
Greg Law: Debugging is really answering one question: “what happened?”. I had certain expectations for what my code was going to do and all I know is that reality diverged from those expectations. Traditional debuggers are of limited help here – they don’t tell you what happened, they just tell you what is happening right now. You hit a breakpoint, you can look around and see what state everything is in, and either it looks all good or you can see something wrong. If it’s good, set another breakpoint and continue. If it’s bad… well, now you want to know what happened, how it became bad. The odds of breaking just at the right point and stepping your code through the badness are pretty long. So you run again, and again, if you’re lucky vaguely the same thing happens each time so you can home in on it; if not, well… you’re in trouble.
With a time travel debugger like UDB, it’s totally different – you see some piece of state is bad, you can just go backwards to find out why. Watchpoints (aka data breakpoints) are super powerful here – you can watch the bad piece of data and run backwards and have the debugger take you straight to the line of code that last modified it. We have customers who have been trying to fix something for literally years who with a couple of watch + reverse-continue operations had it nailed in an hour.
Time travel debuggers are really powerful for any bug where a decent amount of time passes between the bug itself and the symptoms (assertion failure, segmentation fault, bad results produced). They are particularly useful when there is any kind of non-determinism in the program – when the bug only occurs one time in a thousand and/or every time you run the program it fails at a different point in or a different way. Most race conditions are examples of this; so are many memory or state corruption bugs. It can also help to diagnose complex memory leaks. Most leak detectors or static analysis help with the trivial issues( say you returned an error and forgot to add a free) but not the hard ones (for example when you have a reference counting bug and so the reference never hits zero and the resources don’t get cleaned up).
This new white paper provides more insight into what kind of bugs time travel debugging helps with *. It’s not uncommon for software engineers to spend half their time debugging, so it’s a must-read for anyone who wants to increase development team productivity.
By the way, Time Travel Debugging is also sometimes known as Replay Debugging or Reverse Debugging.
Q5: Since you say it lets you see what happened, could it help with code exploration too?
Greg Law: Funny you say that. This is a use case it wasn’t initially designed for, but many engineers are using it to explore unfamiliar codebases they didn’t write. They use it to observe program behaviour by navigating forwards and backwards in the program’s execution history, examine registers to find the address of an object etc. They say there’s a huge productivity benefit in being able to go backwards and forwards over the same section of code until you fully understand what it does. Especially as you’re trying to understand a certain piece of code, and there are often millions of lines you don’t care about right now, it’s easy to get lost. When that happens you can go straight back to where you were and continue exploring.
Debugging is about answering “what did the code do” (ref. cpp.chat podcast on setting a breakpoint in the past **); but there are other activities that involve asking that same question. As I say, most code out there is not really understood by anyone.
Q6: What are your tips on how to diagnose and debug complex C++ programs?
Greg Law: The hard part about debugging is figuring out the root cause. Usually, once you’ve identified what’s wrong, the fix is quite simple. We once had a bug that sunk literally months of engineering time to root cause, and the fix was a single character – that’s extreme but the effect it’s illustrating is very common.
Identifying the problem is an exercise in figuring out what the code really did as opposed to what you expected. Somewhere reality has diverged from your expectations – and that point of divergence is your bug. If you’re lucky, the effects manifest soon after the bug – maybe a NULL pointer is dereferenced and you needed a check for NULL right before it. But more often that pointer should never be NULL, the problem is earlier.
The answer to this is multi-pronged:
1. Liberal use of assertions to find problems as close to their root cause as possible. I reckon that 50% of assert fails are just bogus assertions, which is annoying but cheap to fix because the problem is at the very line of code that you notice. The other 50% will save you a lot of time.
2. If you see something not right, do not sweep it under the carpet. This is sometimes referred to as ‘smelling smoke’. Maybe it’s nothing, but you better go and look and see if there’s a fire. When you’re smelling smoke, you’re getting close to the root cause. If you ignore it, chances are that whatever the underlying cause of the weirdness is, it will come back and bite you in a way that gives you much less of a clue as to what’s wrong, and it’ll take you a lot longer to fix it. Likewise don’t paper over the cracks – if you don’t understand how that pointer can be NULL, don’t just put a check for NULL at the point the segv happened.
This most often manifests itself in people ignoring flaky test failures. 82% of software companies report having failing tests that were not investigated that went on to cause production failures *** (the other 18% are probably lying!). Working in this way requires discipline – following that smell of smoke or fixing that flaky test that you know isn’t your fault will be a distraction from your proximate goal. But when something is not right, or not understood, ignoring it now is going to cost you a lot of time in the long run.
3. Provide a way to know what your code is really doing. The trendy term is observability. This can be good old printf or some more fancy logging. An emerging technique is Software Failure Replay, which is related to Time-Travel Debugging. Here you record the program execution (a failed process), such that a debugger can be pointed at the execution history and you can go back to any line of code that executed and see full program state. This is like the ultimate observability. Discovering where reality diverged from your expectations becomes trivial.
Dr Greg Law is the founder of Undo, the leading Software Failure Replay platform provider. Greg has 20 years’ experience in the software industry prior to founding Undo and has held development and management roles at companies, including Solarflare and the pioneering British computer firm Acorn. Greg holds a PhD from City University, London, and is a regular speaker at CppCon, ACCU, QCon, and DBTest.
** cpp.chat podcast – Setting a Breakpoint in the Past
*** Freeform Dynamics Analyst Report – Optimizing the software supplier and customer relationship
Follow us on Twitter: @odbmsorg