On Software flight recording. Q&A with Jonathan Harris
VP of Product at Undo, Jonathan Harris, just published a new technical paper on Resolving Concurrency Defects – Best Practices. In this interview, I talk to Jonathan about concurrent processing and the issues that arise from developing multiprocess or multithreaded systems.
Q1. What is concurrent computing? And what is it useful for?
Concurrent computing is a set of techniques where multiple threads of execution proceed at the same time. Those threads may be contained within a single process (“multi-threading”), within multiple processes on the same physical or virtual machine (“multiprocessing”), or within multiple processes on different machines (“distributed multiprocessing”). These techniques are often combined – e.g. an application distributed across multiple machines where each process is multi-threaded.
A traditional single-threaded application is limited to the speed of the CPU core on which it is executing. In contrast, concurrent computing allows an application to scale beyond the limits of a single CPU core and/or single machine.
Q2. What are the main “defects” that often arise in concurrent processing scenarios?
Multi-threading, the most ubiquitous concurrency technique, introduces the potential for several new kinds of defects. For example:
- Atomicity violation – In database systems “atomicity” refers to a sequence of operations that either should all occur, or alternatively none should occur. In concurrent computing scenarios “atomicity” refers to a sequence of operations that should not be interrupted. A common technique to guarantee the atomicity of a sequence of operations is to protect the operations with a mutual exclusion object (“mutex”). An atomicity violation occurs where the developer forgets to do so, or fails to ensure that all other intervening threads obtain the mutex object.
- Deadlock and livelock- a deadlock occurs when two or more threads are waiting on a set of resources (such as a mutex or other “lock”) such that none can proceed. A livelock is similar except the threads can execute – e.g. obtaining and releasing those resources – but cannot make effective progress.
The effects of these kinds of defects are compounded by the fact that multi-threading and multiprocessing introduce the possibility of non-deterministic behavior – where the application behaves differently on successive runs due to small timing differences resulting from differences in the kernel’s thread scheduling, speed of I/O etc.
Distributed multiprocessing adds the potential for a further category of defect – where each individual process in the distributed application seems to be behaving correctly, but the final outcome is incorrect.
Q3. With the current state of technology, how do you find and fix concurrency defects?
Concurrency defects are notoriously difficult to diagnose and fix. The non-deterministic nature of concurrent processing means that it is often extremely difficult to engineer a test case where the developer can even reliably reproduce the issue, which is a prerequisite for starting to fix the issue.
Typically your starting point is likely to be the program entering a confusing state or just crashing. From this starting point, developers traditionally perform the following steps to find the cause of a concurrency defect and correct it:
- Recreate the buggy behavior, which can be time-consuming if code is running remotely, on site at a customer
- Hypothesize a cause
- Log application state extensively to revise or validate the hypothesis
- Identify the data structure being affected by the concurrency defect
- Search code for the parts of the program that change the data structure – while this is methodical, it’s painstaking, expensive and prone to human error when you miss the bug in hours of reading over code
- Step through code & breakpoints to find the defect happening
- Correct the code
These steps can take days or weeks. In fact, developers can spend as much as 50% of their time debugging instead of creating new code. What’s worse is that since concurrency defects can be so hard to reproduce, causes often go unfound and uncorrected.
Q4. Are there any better alternatives for reproducing and resolving concurrency defects?
There are a number of new techniques for reproducing and resolving concurrency defects:
- Software execution recording
- Multi-process correlation
- Reverse debugging
- Thread fuzzing
These techniques help developers find more concurrency issues and resolve them 10x faster.
Q5. What is Software execution recording and what is it useful for?
Software flight recording captures a program’s execution into a debuggable recording file which can be replayed at a later date on the same or on a different machine, where it will behave in exactly the same way. The technology is akin to an aircraft black box recording; but instead of recording aircraft trajectory, position, velocity, it records what your software was doing with sufficient detail that it can be reproduced exactly.
Software flight recording therefore allows a developer to capture software failures ‘in the act’, which eliminates the debugging time spent recreating the problem, hypothesizing causes, scanning code, or logging variables.
Q6. What is Thread fuzzing? Can it help revealing concurrency issues during testing?
Wikipedia describes “fuzzing” as “an automated software testing technique that involves providing invalid, unexpected, or random data as inputs to a computer program”. Thread Fuzzing is a form of fuzzing where thread scheduling is randomised, or otherwise interfered with. This can cause concurrency bugs which are rare in normal conditions to become statistically more common.
When used in conjunction with software flight recording, the defects revealed by thread fuzzing can be recorded, replayed and diagnosed.
Q7. For hard-to-reproduce concurrency bugs like race conditions, is reverse debugging useful? If yes, how?
Reverse debugging is the ability of a debugger to stop after a failure in a program has been observed and go back into the history of the execution to uncover the reason for the failure.
For hard-to-reproduce concurrency bugs like race conditions, reverse debugging allows you to start from the point of failure and step backwards to find the cause. This is a very different approach from the typical process of running and rerunning a program again and again, adding logging logic as you go until you find the cause.
Jonathan Harris. VP Product Undo
Jonathan oversees Undo’s products. He is passionate about providing Undo’s customers with cutting-edge products that deliver an enchanting user experience.
Jonathan has over twenty years’ experience in the software industry in both development and marketing roles; including as a developer at Acorn and Psion, an Engineering manager and Product manager at Symbian Ltd, and CTO and Product Strategist at Tizen Association.
In his spare time Jonathan still manages to find some time to code for flight simulation and other 3D environments, and enjoys traipsing over dusty Roman ruins in Europe and North Africa – particularly those ruins that are close to the “sherry triangle” or to Porto.
Sponsored by Undo.