On Debugging with AI. Interview with Mark Williamson

by Roberto V. Zicari on September 11, 2025

“Quality of code (and everything that goes along with it) isn’t talked about enough in AI conversations! There are some obvious facets to this – does the code do what you intended? Is it fast? Does it crash in the common cases?”

Q1. Can AI write better code than humans?

Mark Williamson: I don’t think so, at least not today. For one thing, LLM-based AIs are trained on pre-existing code, which was written by fallible humans. So they at least have the potential to make all the mistakes we do.

Despite that, any coding AI you pick will write better frontend Javascript than me – that’s not my area of expertise. But I would back an experienced human (with or without AI assistance) to beat an unsupervised AI coder.

Can they beat humans some day? I assume so – but they’re not doing it today. And when you factor in other aspects of the Software Engineer’s job (such as building the right thing) it’s even more challenging.

Q2. How do you define what is a “better” code?

Mark Williamson: Quality of code (and everything that goes along with it) isn’t talked about enough in AI conversations! There are some obvious facets to this – does the code do what you intended? Is it fast? Does it crash in the common cases?

A lot of the work a human developer does to achieve this is actually achieved after the initial code is typed in. There’s an iterative process of learning about and refining the solution – understanding what you’ve made and improving on it. A lot of this is really debugging, in the broadest sense of the term: the code doesn’t do what you expected and you need to understand and fix it.

There’s another step beyond that, though – whether the code fits its intended purpose. Getting that fit requires understanding the end user, thinking through the implementation tradeoffs and anticipating future developments. For now, I see AI as freeing up some time so we can create space for those human insights.

Just focusing on how many lines of code we create is a pattern in the industry – we overvalue simply generating code versus all the other things that software engineers actually do.

Q3. Can AI write some types of code faster and with fewer simple errors?

Mark Williamson: Yes!

In my experience, I’ve found AI to be extremely useful in three scenarios:

Writing code that is almost boilerplate – where it’s not a copy-paste problem but requires quite routine changes.
Writing code that would be boilerplate for a different engineer – e.g. if I want to write JSON serialisation / deserialisation code in Python it’s easier for me to get an AI assistant to show me the shape of a good solution.
Doing refactors that involve restructuring or applying a small fix in a lot of places – a coding agent can handle the detail while I concentrate on the overall shape.

In all these cases, the benefit is in reducing the amount of thinking required to figure out my design approach. In Daniel Kahneman’s book Thinking Fast and Slow, he describes two modes of thought: System 1 and System 2. System 1 is the stuff you can just answer automatically, whereas System 2 thought requires effort.

System 2 is tiring – you probably can’t manage more than a couple of hours of really hard thinking about code in a day. So it’s precious. An agent lets me offload some work so I can focus that effort on exploring solutions to the real problem I’m trying to solve.

Q4. Large Language Model (LLM)-based AI code assistants are powerful tools, but they have significant limitations that developers must understand. What are such limitations?

Mark Williamson: The most obvious limitation is that they don’t know everything. They often act as though they do, which is a trap. “Hallucinations” are the most well-known consequence of this – in which the LLM gives an answer that is confident but ultimately not based in fact.

I like to say that modern AI’s training teaches it what a good answer looks like – they’ve seen lots of examples of them, after all. So, from an AI’s point of view, a good answer includes attributes like:

Projecting confidence.
Using the right terminology.
Relating suggestions specifically to your question and context.
Being right!

If it can satisfy most of those, then it’ll think it’s done a good job. So when they’re asked a question and they lack facts, an AI will figure “3 out of 4 isn’t bad” and give a dangerously convincing answer that’s not based in reality.

There are two important things we can do to reduce this risk:

Supply high-quality context to the underlying model – the more relevant information available the better. Supplying insufficient information invites the model to guess and supplying irrelevant information encourages it to head off on the wrong track.
Verify the model’s answers against a ground truth – run your tests, have experts review your code, verify the dynamic behaviour of the application matches what you expected.

You want to focus the model’s intelligence on solving the real problem (not on guessing), then know when it has actually solved it.

Q5. While LLM-based code assistants are incredibly powerful, there is critical information they lack that limits their effectiveness and makes human oversight essential. Why this?

What does it mean in practice?

Mark Williamson: As a CTO, I’ll divide my answer into two parts:

As an engineer, LLMs don’t know enough about your code to solve all the problems you wish they could solve. They typically don’t have good knowledge of the runtime behaviour of the system, which makes incorrect answers more likely. And they’re not good at inferring design intent, making it harder to fix subtle bugs correctly.
As a product manager, LLMs lack the insight into the true purpose of the software to be built. You cannot rely on them to design the code to the needs of the end users, long term evolution / maintenance and business tradeoffs required.

Q6. LLMs are brilliant at static analysis—interpreting the text of a codebase, logs, and other documents. But they are blind to dynamic behavior. This is the critical information they lack and cannot get. Why? Do you have a solution for this problem?

Mark Williamson: Coding agents have a similar weakness to humans: they can’t see what the program really did at runtime and it’s hard to reason about why things happened. They can get some of this from logs (and LLMs are really good at reading logs!) but logging can only capture so much.

There’s a catch 22 here for the developer: if you’d been able to predict precisely what logging you’d need to fix the bug you’re investigating, then you’d have known enough to avoid the bug in the first place. There’s no reason to think that’s different for LLMs.

Coding agents can follow the same tedious loop that humans do: adding more logging to a codebase and running stuff again (or perhaps asking a human to obtain more logs some other way).

They can even do this toil more enthusiastically than any human! But the speed you gained from the agent may just disappear into a swamp of rebuilding, attempting to reproduce, finding what logging statements are still missing and then repeating the process. This kind of inefficiency will be bad news for any Engineering department hoping to improve productivity in return for their AI spend.

Q7. It seems that time travel debugging (TTD) directly addresses this limitation. Please tell us more.

Mark Williamson: Time travel debugging captures a trace of everything a program does during execution. The resulting recordings effectively represent the whole state of memory at every machine instruction the program executed.

Anything you want to know about the program’s runtime behaviour can then be queried from the recording, without needing to re-run or change the code. Rare bugs become fully reproducible and any state can be explored in detail. Moreover, the ability to rewind time makes it easy to explore why a bad state arose, not just what the state was.

Of course, storing all of memory at every point in execution time would be extremely inefficient! A modern, scalable time travel debugger stores only information that flows into the program (initial memory state, IO from disk and network, system calls results, non-deterministic CPU instructions, etc). This makes it possible to efficiently recompute all other state on demand. Watch the talk “How do Time Travel Debuggers Work?” for the full details on how a modern time travel debugger is built.

For an AI, this capability is ideal. Remember that we need high-quality context to feed the model and a ground truth to make sure its answers are based in reality. With time travel debugging, a coding agent has access to a recording of the program’s dynamic state and can drill down in detail on any suspicious behaviours – that gives us high-quality context. The ground truth comes from the deterministic nature of the recording and also makes it possible to verify the AI’s findings.

These properties mean that AI coding agents get smarter when given access to a time travel debugging system.

Q8. You have released an add-on extension called explain, which integrates with your UDB debugger (part of the Undo Suite). What is it and what is it useful for?

Mark Williamson: Good question. Let me explain first what Undo is to set the context. It’s our time travel debugging technology (which runs on Linux x86 and ARM64) and is mostly used to debug complex enterprise software that makes use of advanced multithreading techniques, shared memory, direct device accesses, etc.

The Undo Suite captures precise recordings of unmodified programs using just-in-time binary instrumentation.The two main components of the Undo Suite are:

LiveRecorder – which captures program executions into portable recording files.
UDB – which provides a GDB-compatible interface to debug both live processes and recordings (but also integrates into IDEs such as VS Code).

The explain extension is our first step in integrating AI with a time travel debugging system. It provides two pieces of functionality:

An MCP (Model Context Protocol) server – this exports the functionality of our UDB debugger for use by an AI agent, allowing it to integrate into existing AI workflows including agentic IDEs (such as VS Code with Copilot, Cursor or Windsurf).
The explain command itself, which provides additional tight integration with terminal-based coding agents (such as Claude Code, Amp and Codex CLI) where available.

In either case, we’re providing the power of time travel debugging to an AI, so that it can reason about the dynamic behaviour of a program. As the name suggests, this extension has a particular focus on explaining program behaviour – how a given state arose, why the program crashed, etc.

We provide a carefully-designed set of tools to the agent so that it can answer these questions effectively. It’s important that the design of the MCP tools guides the actions to be taken by the LLM, otherwise it can easily get overwhelmed by the complexity.

In an agentic IDE you can connect to the MCP server in a running UDB session – then ask the agent questions (use the /explain prompt exported by the server for best results). In UDB itself, you can just type the explain command and we’ll automatically invoke your preferred terminal coding agent and put it to work on your problem.

Q9. Can you show us an example of how time traveling with an AI code assistant works in practice?

Mark Williamson: Sure! I’d recommend watching these two demo videos:

The cache_calculate demo video on the Undo website which showcases how to use explain to get AI to tell you what has gone wrong in the program.
This YouTube video where I use AI + time travel debugging to explore the codebase of the legendary Doom game and understand exactly what the program did when I played it.

We have additional demos, showcasing more advanced functionality, which aren’t yet public – you can book a personalised demo from https://undo.io/products/undo-ai/ to see the more advanced AI debugging functionality we’re currently building.

Qx. Anything else you wish to add?

Mark Williamson: The core message here is that AI-Augmented Software Engineers still need the right tools to do their jobs well. Our goal is to make AI coding agents more effective at understanding and fixing complex code, improving the return on investment Engineering departments get on their AI stack.

The next big step for us will be designing a UX to be used by AIs instead of by humans. Providing time travel debugging to a coding agent is already useful, but to get the best performance we need to work with what LLMs are good at. In other words:

A query-like interface: rather than the statefulness of a debugger, LLMs are happiest when they can ask Big Questions and get a report in answer. Our engine lets us extract detailed information very quickly from a recording so that an AI can start with an overview, then drill down.
Specialised, composable tools: a debugger provides quite general tools (stepping, breakpoints, etc) for a human developer to apply to any problem. Coding agents can use these but we believe LLM intelligence is best spent on solving the core problem well, rather than diluting it on planning complex tool use. A specialised set of analyses will allow the LLM to focus on what it’s good at – finding patterns and proposing fixes.

On top of these tools and the data contained within our recordings, we are building Undo AI – a product to enable agentic debugging at enterprise scale. We’re currently taking applications for our pilot program, please get in touch to find out more at undo.io .

……………………………………………

Mark Williamson, Chief Technical Officer, Undo

After a few years as our Chief Software Architect, Mark is now acting as Undo’s CTO. Mark loves developing new technology and getting it to people who can benefit. He is a specialist in kernel-level, low-level Linux, embedded development with a wide experience in cross-disciplinary engineering.

In his previous role, his remit was to align the product’s architecture with the company’s needs, provide technical and design leadership, and lead internal quality work. One of his proudest achievements is his quest towards an all-green test suite!

As Undo’s CTO, Mark’s primary responsibility is to scale product-market fit and ensure we take our products in the right direction to meet the needs of a broader spectrum of customers.

Mark is also author on Medium, a conference speaker, and a new home owner enjoying the delights of emergency home repairs!

On Debugging with AI. Interview with Mark Williamson

Leave a Reply Cancel reply

About the author

Archives

Meta

About

Flickr

Search

On Debugging with AI. Interview with Mark Williamson

Leave a Reply Cancel reply

About the author

Tags

Archives

Meta

About

Flickr

Search