On Software Reliability. Interview with Barry Morris and Dale Vile.
” When software reliability issues creep up in production, it’s a finger-pointing moment between suppliers and users. Usually, what’s missing is simple: information. ” –Barry Morris.
“Most organisations also suffer from more continuous disruption caused by a steady stream of less dramatic issues. Intermittent software problems particularly cause a lot of user frustration and dissatisfaction.” — Dale Vile.
I have interviewed Barry Morris, CEO of Undo and Dale Vile, Distinguished Analyst Freeform Dynamics. Main topic of the interview is enterprise software reliability. This interview relates to a recent research on the challenges and impact on troubleshooting software failures in production, conducted by Freeform Dynamics.
Q1. How often software-related failures occur in the enterprise?
Dale Vile: When we hear the term ‘software failure’, we tend to think of major incidents that bring down a whole department or result in significant data loss. Our study suggests that this kind of thing happens around once every couple years on average in most organisations – at least that’s what people admit to when surveyed. The research also tells us, however, that most organisations also suffer from more continuous disruption caused by a steady stream of less dramatic issues. Intermittent software problems particularly cause a lot of user frustration and dissatisfaction.
Q2. What are the common reasons why major system failures and/or incidents leading to loss of data are top of the list when it comes to the potential for damage and disruption?
Dale Vile: Software is now embedded in most aspects of most businesses. A telling observation is that over the years, the percentage of applications considered to be business critical has steadily increased. At the turn-of-the-century, it was usual for organisations to tell us that around 10% of their application portfolio was considered critical. Nowadays, it’s more likely to be 50% or more. This is why it’s so disruptive and potentially damaging when software failures occur – even relatively brief or minor ones.
Barry Morris: The study we commissioned shows that 83% of enterprise customers consider data corruption issues to be highly disruptive to their business. In the database business, that’s probably closer to 100%. Take SAP HANA, Oracle, Teradata or other data management system vendors: they have clients paying them millions of dollars per year for a reliable and predictable system. Consequences are high if the wrong row is returned, there’s a memory corruption issue, or data goes missing. These types of clients have little tolerance that. At best, your reputation in the industry and software renewals will be on the line. At worst, you’re talking about plummeting stock prices wiping off a few millions off the value of your business.
Q3. What are the most important challenges to achieve software that runs reliably and predictably?
Dale Vile: It starts with software quality management in the development or engineering environment. Most of the challenges we see here are to do with adjusting testing and quality management processes to cope with modern approaches such as Agile, DevOps and Continuous Delivery. A lot of people now refer to ‘Continuous Testing’ in this context, and understandably put a lot of emphasis on automation. But even software makers are on a journey here. Our research tells us that few have it fully figured out at the moment. Beyond this, effective testing in the live environment is also essential.
The problem here, though, is that the complex and dynamic nature of today’s enterprise infrastructures makes it very hard or impossible to test every use case in every situation. And even if you could, subsequent changes to the environment, which an application team may not even be aware of, could easily interfere with the solution and cause instability or failure. There’s a lot to think about, and quality management is only the start.
Q4. What factors are influencing users’ satisfaction and confidence with respect to software?
Dale Vile: Confidence and satisfaction stem from users and business stakeholders perceiving that those responsible are working together competently and effectively to resolve issues when they occur in a timely manner. A fundamental requirement here is openness and honesty, and a willingness to take responsibility. Defensiveness, evasion and finger-pointing, however, tend to undermine confidence and satisfaction. Such behaviour can be cultural; but very often it’s more a symptom of inadequate skills, processes and/or tools within either the supplier or the customer environment. When such shortfalls in capability exist, the inevitable result is an elongated troubleshooting and resolution cycle. This is the real killer of confidence and satisfaction.
Barry Morris: When software reliability issues creep up in production, it’s a finger-pointing moment between suppliers and users. Usually, what’s missing is simple: information. Right now, to obtain that information, suppliers ask 20 questions: what did you do, how did you do it, in what environment and so on. There’s a long period of communication & diagnostic, which is frustrating and time-wasting on both sides. That supplier/user relationships at that moment of firefighting would be massively improved if there was data on the table and engineers could just get on with fixing the problem. I see data-driven defect diagnostic as the key to improving customer satisfaction.
Q5. How effective is software quality management in the enterprise?
Dale Vile: I’ll answer the question in relation to software *reliability* management, which is a function of inherent software quality, effective implementation, and competent operation and support thereafter. We generally find that each group or team tends to do reasonably well in their specific area; but challenges often exist because the various silos I disconnected. What many are lacking is good communication and mutual understanding between those involved in the software lifecycle. Lack of adequate visibility and effective feedback is also a common issue. Most organisations on both the supplier and enterprise side are working on improvement, but gaps frequently exist in these kinds of areas, which in turn impact software reliability.
Barry Morris: Despite all the processes and tools put in place in dev/test, we still see mission-critical applications being shipped with defects. Worse, they are being shipped with known defects – some of which could turn disastrous. Ticking time bombs really. Why? Because of tricky intermittent failures that no-one can get to the bottom of.
So actually, in a lot of cases, I don’t think that the software quality management practices I see are as effective as they could be.
Q6. How effective are the commonly used troubleshooting and diagnostics techniques?
Dale Vile: As mentioned above, the most common problems I see here are to do with disconnects between the various teams involved. Within the engineering environment, this is often down to developers and quality teams working in silos with inefficient handoffs and ineffective feedback mechanisms. In the enterprise context, it’s the disconnect between application teams, operations staff and even service desk personnel. Added to this, many also struggle to join the dots to figure out what’s going on when problems occur, and communicate insights back to developers so they can take appropriate action. Against this background, it’s not surprising that over 90% of both software makers and enterprises report that issues frequently go undiagnosed and come back to bite in a disruptive and often expensive manner.
Barry Morris: Sometimes, traditional methods troubleshooting methods like printf, logging, or core dump analysis are the right solutions if the team is confident they can isolate the issue quickly. Static and dynamic analysis tools are also good options for certain classes of failures. But in more complex situations, traditional debugging methods don’t help much. If anything, they lead you down the wrong path with false positive and become time-wasting, which leads to serious client dissatisfaction.
Q7. You wrote in your study that the big enemies of stakeholders and user satisfaction are delay and uncertainty. What remedies do exist to alleviate this?
Dale Vile: Beyond the kind of processes and tools we have mentioned…it boils down to effective communication and adequate visibility.
Barry Morris: I think that next-gen troubleshooting systems like software recording technology (such as what we offer at Undo) offer a unique solution to the problem of software reliability. Once we move away from guesswork and use data-driven insight instead, application vendors will be able to resolve the most challenging software defects faster than they have ever been able to do before. The unnecessary delays and uncertainty will be a thing of the past.
Q8. You wrote in your study that software failures are inevitable. It is what happens when they occur that really matters. Can you please explain what do you mean here?
Dale Vile: No one expects perfection; not even business users and stakeholders. So provided the software isn’t wildly buggy or unstable, it mostly comes down to how well you respond when problems occur. What annoys people the most in this respect is not knowing what’s going on. Informing someone that you know what the problem is, but it’s going to take some time to fix, is much better than telling them you have no idea what’s causing their problem. Even better if you can give them a timescale for a resolution, and/or a workaround that doesn’t represent a major inconvenience. Interestingly, if you diagnose and fix a problem quickly, the research suggests that you can actually turn a software incident into a positive experience that enhances satisfaction, confidence and mutual respect.
Q9. What remedies are available for that?
Dale Vile: A big enabler here is a modern approach to diagnostics: having the tooling and the processes in place that allow you to troubleshoot effectively in a complex production environment. Traditional approaches are often undermined by the sheer number of moving parts and dependencies, so you need a way to deal with that. This is where solutions such as program execution recording and replay capability (aka software flight recording technology) can help.
Q10.You wrote in your study that switching decisions are often down to simple economics. Can you please explain what do you mean?
Dale Vile: If an application is continually causing problems, the result is increased cost. At one level, this could be down to the additional resource required to support, maintain and troubleshoot software defects. Often more significant, is the end-user productivity hit that stems from people not able to do their jobs properly and efficiently.
There are then various kinds of opportunity costs, e.g. weeks spent battling unreliable software is time not being spent adding value to the business. In extreme cases, such as when customer facing systems are involved, repeated failure can lead to reputational damage, loss of customer confidence, and ultimately lost revenue and market share. It depends on the organisation and the specific application; but in every case there comes a point when the cost to the business of living with unreliable software is ultimately higher than the cost of switching.
Q11. What are the most effective solutions to software diagnostic processes?
Dale Vile: Solutions that work holistically. It’s about capturing all of the relevant events, inputs and variables, especially at execution time in the production environment; then providing actionable data and insights for engineers to facilitate rapid diagnosis and resolution.
Barry Morris: Dale is right. The most effective solutions to software failure diagnostic are those that provide full visibility and definitive data-driven insight into what your software really did before it crashed or resulted in incorrect behaviour. Software recording technology will speed up time-to-resolution by a factor of 10. But the beauty of this kind of approach is that you can now diagnose even the hardest of bugs that you couldn’t resolve before – just because a recording represents the reproducible test case you couldn’t obtain before.
Q12. What are the main conclusions of your study?
Dale Vile: In summary, in a world where software is critical to the business, applications must be reliable; otherwise damaging and costly disruption will result. With this in mind, it’s important to be able to respond quickly and effectively when problems occur. This shines a clear spotlight on diagnostics – an area in which many have clear room for improvement. New approaches and tools are required here, especially for troubleshooting in complex production environments.
The good news is that technology is emerging that can help, but at the moment we see an awareness gap. Our recommendation is therefore for anyone involved in software delivery and support to get up to speed on what’s available, e.g. from companies like Undo and others.
Qx. Anything else you wish to add?
Dale Vile: When you get right down to it, software reliability is a business issue. One of the most striking findings from the research for me is the level of willingness among enterprise customers to switch solutions and suppliers when the pain and cost of unreliable software gets too high. This should be a wake-up call for ISVs and other software makers, not just to manage product quality, but also to work proactively with customers on preventative diagnostic and remedial activity.
Barry Morris: As systems are becoming more and more complex, troubleshooting is not getting any easier…so has to be data-driven.
With over 25 years’ experience working in enterprise software and database systems, Barry is a prodigious company builder, scaling start-ups and publicly held companies alike. He was CEO of distributed service-oriented architecture (SOA) specialists IONA Technologies between 2000 and 2003 and built the company up to $180m in revenues and a $2bn valuation.
A serial entrepreneur, Barry founded NuoDB in 2008 and most recently served as its Executive Chairman. Barry has now been appointed as CEO in September 2018 to lead Undo‘s high-growth phase.
Dale is a co-founder of Freeform Dynamics, and today runs the company.
He oversees the organisation’s industry coverage and research agenda, which tracks technology trends and developments, along with IT-related buying behaviour among mainstream enterprises, SMBs and public sector organisations.
During his 30 year career, he has worked in enterprise IT delivery with companies such as Heineken and Glaxo, and has held sales, channel management and international market development roles within major IT vendors such as SAP, Oracle, Sybase and Nortel Networks. He also spent a couple of years managing an IT reseller business for Admiral Software.
Dale has been involved in IT industry research since the year 2000 and has a strong reputation for original thinking and alternative perspectives on the latest technology trends and developments. He is a widely published author of books, reports and articles, and is an authoritative and provocative speaker.
Hosted by Prof. Zicari of ODBMS.org and featuring Undo CEO Barry Morris and Distinguished Analyst Dale Vile, Freeform Dynamics, this webinar recording covers:
– New market research – the frequency, types, and economic impact of defects on users and developers of enterprise software.
– The importance of fast diagnostics and swift remediation when problems occur in production.
– How to increase enterprise software reliability with software flight recording technology
“The challenges, impact and solutions to troubleshooting software failures“, Freeform Dynamics. Access the full study report here (LINK registration required).
– On Software Quality. Q&A with Alexander Boehm (SAP) and Greg Law (Undo). ODBMS.org, November 26, 2018.
Dr. Alexander Boehm is a database architect working on SAP´s HANA in-memory database management system. Greg Law is Co-founder and CTO of Undo.
Follow us on Twitter: @odbmsorg