On Continuous Integration and Software Flight Recording Technology. Interview with Barry Morris
“The key challenge, however, is the cultural change required within software engineering teams to evolve to a state where any software failure, no matter how insignificant it may seem, is unacceptable. No single software engineer, or team, possesses all of the technical experience required to keep a CI pipeline functioning at this level. There must be a cross-disciplined commitment to work towards this goal throughout the development lifecycle in order to be effective.” –Barry Morris
I have interviewed Barry Morris, well-know serial entrepreneur and currently CEO at Undo. We talked about the challenges to deliver high quality software at a productive level, the cost of persistent failures in Continuous Integration (CI) pipelines, and how Software Flight Recording Technology could help.
Q1. What are typical challenges software engineering teams face to deliver high quality software at a productive level?
Barry Morris: Reproducibility is the fundamental problem plaguing software engineering teams. The inability to rapidly, and reliably, reproduce test failures is slowing teams down. It blocks their development pipeline and prevents them from delivering software on time, and with confidence.
Organizations that can solve the issue of reproducibility are able to confidently deliver quality software on a scheduled, repeatable, and automated basis by eliminating the guesswork associated in defect diagnosis. The best part is that it does not require a complete overhaul of existing tool sets – rather an augmentation to current practices.
The key challenge, however, is the cultural change required within software engineering teams to evolve to a state where any software failure, no matter how insignificant it may seem, is unacceptable. No single software engineer, or team, possesses all of the technical experience required to keep a CI pipeline functioning at this level. There must be a cross-disciplined commitment to work towards this goal throughout the development lifecycle in order to be effective.
Q2. Software failures are inevitable. Do you believe the adoption of Continuous Integration (CI) as a key contributor to agile development workflows, is the solution?
Barry Morris: Despite the best efforts of software engineering teams, there are too many situational factors outside of their direct control that can cause the software to fail. As teams add new features, new processes, new microservices, and new threading to their code, the risk of unpredictable failures grows exponentially.
The adoption of CI as a key contributor to agile development workflows is on the rise. I believe it is the key to delivering software at velocity and offers radical gains in both productivity and quality. According to a recent survey conducted by Cambridge University, 88% of enterprise software companies have adopted CI practices.
Q3. It seems that the volume of tests being run as a result of CI leads to a growing backlog of failing tests. Is it possible to have a zero- tolerance approach to failing tests?
Barry Morris: Unfortunately, the volume of tests being run as a result of CI leads to a growing backlog of failing tests – ticking time bombs just waiting to go off – costing shareholders $1.2 trillion in enterprise value every year.
True CI requires a zero-tolerance approach to software failures. Tests must pass reliably and any failures represent new regressions. Failures that only show up once every 300 runs, or under extreme conditions only make this more challenging. The same survey also found that 83% of software engineers cannot keep their test suite clear of failing tests
Q4. You are offering a so called Software Flight Recording Technology (SFRT). What is it and what is it useful for?
Barry Morris: SFRT enables software engineering teams to record and capture all the details of a program’s execution, as it runs. The recorded output allows the team to then wind back the tape to any instruction that executed and see the full program state at that point. Whereas static analysis provides a prediction of what a program might do, SFRT provides complete visibility into what a program actually did, line by line.
SFRT can speed up time-to-resolution by a factor of 10 by eliminating guesswork, using real, actionable data-driven insights to get to the crux of the issue, faster. But the beauty of this kind of approach is that it is not simply a last line of defense against the most challenging defects (e.g intermittent bugs, concurrency defects, etc). Rather, it can be used to improve the time-to-resolution of all software failures.
Q5. Is SFRT the equivalent to a black box on an aircraft?
Barry Morris: Yes, absolutely.
Q6. When a plane crashes, one of the first things responders do is locate the black box on board. How does it relate to software failures?
Barry Morris: When a plane crashes, one of the first things responders do is locate the black box on board. This device tells them everything the plane did – its trajectory, position, velocity, etc. – right up until the moment it crashed. SFRT can do the same for software, allowing software engineering teams to view a recording of what a program was doing before, during, and after a defect occurs.
Q7. Who has already successfully used Software Flight Recording Technology to to capture test failures?
Barry Morris: SAP HANA, a heavily multi-threaded, feature-rich, in-memory database, is built from millions of lines of highly-optimized Linux C++ code. To ensure the software is high-quality and reliable, the engineering team invested considerably in CI and employed rigorous testing methodologies, including fuzz-testing.
However, non-deterministic test failures could not reliably be reproduced for debugging. Analyzing logs from failed runs could not capture enough information to identify the root cause of specific failures; and reproducing complex failures on live systems was time-consuming. This was slowing development down.
LiveRecorder, Undo’s platform based on Software Flight Recording Technology, was implemented to capture test failures. Recording files of those failing runs were then replayed and analyzed. With LiveRecorder, engineers could see exactly what their program did before it failed and why – allowing them to quickly hone-in on the root cause of software defects.
As a result, SAP HANA was able to accelerate software defect resolution in development, by eliminating the guesswork in software failure diagnosis. On top of significantly reducing time-to-resolution of defects, SAP HANA engineers managed to capture and fix 7 high-priority defects – including a couple of race conditions, and a number of sporadic memory leaks and memory corruption defects.
Q8. What are the key questions to consider when developing CI success metrics?
Barry Morris: Every organization judges success differently. To some, finding a single, hard-to-reproduce bug per month is enough to deem changes to their CI pipeline as effective. Others consider the reduction in the amount of aggregate developer hours spent finding and fixing software defects per quarter as their key performance indicator. Speed to delivery, decrease in backlog, and product reliability are also common metrics tracked.
Whatever the success criteria, it should reflect the overarching goals of the larger software engineering team, or even corporate objectives. To ensure that teams measure and monitor the success criteria that matters most to them, software engineering managers and team leads should establish their own KPIs.
Some questions to consider when developing CI success metrics:
- Is code shipped earlier than previous deployments?
- How many defects are currently in the backlog compared to last week/month?
- Are developers spending less time debugging?
- Are other teams waiting for updates?
- How many developer hours does it take to find and fix a single bug?
- How long does it take to reproduce a failure?
- How long does it take to fix a failure once found?
- What is the average cost to the organization of each failure?
These questions are designed as an initial starting point. As mentioned earlier, each organization is different and places value on certain aspects of CI depending on team dynamics and needs. What’s important is to establish a baseline to ensure agreement and commitment across teams, and to benchmark progress.
Barry Morris, CEO, Undo.
With over 25 years’ experience working in enterprise software and database systems, Barry is a prodigious company builder, scaling start-ups and publicly held companies alike. He was CEO of distributed service-oriented architecture (SOA) specialists IONA Technologies between 2000 and 2003 and built the company up to $180m in revenues and a $2bn valuation.
A serial entrepreneur, Barry founded NuoDB in 2008 and most recently served as its Executive Chairman. Barry has now been appointed as CEO in September 2018 to lead Undo‘s high-growth phase.
– Research Report: The Business Value of Optimizing CI pipeline. Judge Business School from the University of Cambridge in partnership with Undo (link to download the report- registration required)
The research concluded three key findings:
- Adoption of CI best practices is on the rise. 88% of enterprise software companies say they have adopted CI practices, compared to 70% in 2015
- Reproducing software failures is impeding delivery speed. 41% of respondents say getting the bug to reproduce is the biggest barrier to finding and fixing bugs faster; and 56% say they could release software 1-2 days faster if reproducing failures wasn’t an issue
- Failing tests cost the enterprise software market $61 billion. This equals 620 million developer hours a year wasted on debugging software failures
 Improving Software Quality in SAP HANA, 2018
– Technical Paper: Software Flight Recording Technology, Undo (link: registration required to download the paper.)
Follow us on Twitter: @odbmsorg