Click here for a pdf version of the article.

Software Defect Reduction Top-10 List

Barry Boehm, USC and Victor Basili, U. of Maryland

 

Recently, a grant from the National Science Foundation's Information Technology Research program enabled us to establish a national Center for Empirically-Based Software Engineering (CeBASE).  The CeBASE objective is to transform software engineering as much as possible from a fad-based practice to an engineering-based practice through derivation, organization, and dissemination of empirical data on software development and evolution phenomenology.

 

"As much as possible" reflects the fact that software development will always remain a people-intensive and continuously changing field.  However, in roughly 30 years each of empirical study of software phenomenology, we have found that people in the field have been able to establish objective and quantitative data, relationships, and predictive models which have helped hundreds of thousands of software developers avoid predictable pitfalls and improve their ability to predict and control efficient software projects.

 

As a way of illustrating this, we are devoting this column to an update of one of our previous columns ("Industrial Metrics Top-10 List," by Barry Boehm, IEEE Software September 1987, pp. 84-85) which provided a concise selection of empirical data which many software practitioners found very helpful.  As CeBASE is focusing on two areas of major concern--software defect reduction and COTS-based systems--we will provide a recent defect reduction top-10 list in this column and a COTS-based systems top-10 list in a subsequent column.  Here is the defect reduction list, in rough priority order:

 

1. Finding and fixing a software problem after delivery is often 100 times more expensive than finding and fixing it during the requirements and design phase.

 

This was also the top-priority item in the 1987 list.  As in 1987, "This insight has been a major driver in focusing industrial software practice on thorough requirements analysis and design, on early verification and validation, and on up-front prototyping and simulation to avoid costly downstream fixes."

 

The only thing we have changed since 1987 is to add the word "often," to reflect additional insights on the relationship.  For one, the cost-escalation factor for small, noncritical software systems is more like 5:1 than 100:1, enabling such systems to be developed most efficiently in a less formal, "continuous prototype" mode -- but still with emphasis on getting things right early rather than late.  Another is that the cost-escalation factor can be reduced significantly even for large critical systems via good architectural practices.  These reduce the cost of most fixes by confining them to small, well-encapsulated modules.  An excellent example was the million-line TRW CCPDS-R project described in Appendix D of Walker Royce's Software Project Management: A Unified Approach, Addison-Wesley, 1988, where the cost-escalation factor was only about 2:1.

 

            There is NASA data on inspections and testing that supports the premise that finding defects earlier in the project development cycle is cheaper than finding it later. At JSC, it was two times easier to find a defect during inspection than to find it during test. I.e., finding it early in project averaged 1.2 hours for easy defects, and 1.4 hours for hard defects, while finding a defect late in the project took 1.5 hours for easy defects and 3 hours for hard defects. At JPL, fixing a defect found in inspections took an average of .7 hours, while fixing it during test took 5 to 18 hours of effort.

 

2. About 40-50% of the effort on current software projects is spent on avoidable rework.

 

      "Avoidable rework" is effort spent fixing difficulties with the software that could have been avoided or discovered earlier and less expensively.  This implies that there is such a thing as "unavoidable rework."  This fact has been increasingly appreciated with the growing realization that better user-interactive systems result from "emergent" processes (where the requirements emerge from prototyping and other multi-stakeholder shared learning activities) than from "reductionist" processes (where the requirements are stipulated in advance and then reduced to practice via design and coding).

 

      We believe that this distinction is essential to a modern theory and practice of software defect reduction.  Changes to the definition of a system that make it more cost-effective should not be discouraged by classifying them as defects to be avoided.  This kind of discouragement usually results in belated recognition that the system is unsatisfactory and large amounts of avoidable rework.  On the other hand, projects need to avoid continuous streams of arbitrary changes, which are not cost-effective either.

 

      Reducing avoidable rework is thus a major source of software productivity improvement.  In our behavioral analysis of the effects of software cost drivers on effort for the COCOMO II model (B. Boehm et al., Software Cost Estimation with COCOMO II, Prentice Hall, 2000) most of the effort savings from improving software process maturity, software architectures, and software risk management came from reductions in avoidable rework.

 

3. About 80% of the avoidable rework comes from 20% of the defects.

 

      For smaller systems, the 80% number may be lower; for very large systems, it may be higher.  On one large TRW-government system, a single avoidable defect in the requirements caused over 100-person years of rework (B. Boehm, "A Large Sequential-Engineering Near-Disaster," IEEE Computer, March 2000, p.115).  Two major sources of avoidable rework are hastily-specified requirements and nominal-case design and development (where late accommodation of off-nominal requirements causes major architecture, design, and code breakage).  Further, if you have a software problem report tracking system which records the effort to fix each defect, it is fairly easy for you to analyze the data to determine and address additional major sources of rework in your organization.

 

4.  About 80% of the defects come from 20% of the modules and about half the modules are defect free.

 

            There have been several studies over the years that have aimed at identifying high risk components. These studies typically collect defect data from system test through acceptance test to some period of operation.  Representative results are of the form: 76% of the faults come from 20% of the modules (Endres 1975), 83% of the faults come from 12% of modules (Basili, Perricone 1984), 82% of the faults come from 20% of the modules (Khoshgoftaar and Allen 1999), 60% of the faults come from 20% of the modules (Fenton, Ohlsson 2000). The data from different environments over many years is amazingly consistent.

 

      What also appears to be consistent is that all of the defects are contained in about half of the modules. This data is representative of each of the above studies, as well as by (Basili, Selby, Phillips 1983), (Briand, Basili, Hetmanski 1993), (Khoshgoftaar and Allen 98).

 

      Thus, it is worth the effort to identify the characteristics of error prone modules in a particular environment. There are a variety of factors that contribute to error-proneness that appear to be context dependent. However, some factors that contribute to error-proneness are the level of data coupling and cohesion, size, complexity, and amount of change to reused code.

 

5.  About 90% of the downtime comes from at most 10% of the defects.

 

            It is obvious that all faults are not equal in terms of their rate of occurrence. That is, some defects have a disproportionate effect on downtime and reliability of a system than others. In analyzing the software failure history of nine large IBM software products, (Adams 1984) (E. N. Adams, “Minimizing Cost Impact of Software Defects.” IBM Journal of Research and Development, vol. 28 no. 1, January 1984) found a wide range of failure rates (measured in usage months) and a high percentage of very low rate errors.  Based upon the data from those projects, about .3% of the defects account for about 90% of the downtime. Thus understanding the operational profiles of a system and testing according to that profile is clearly cost effective.

 

6.  Peer reviews catch 60% of the defects.

 

      Given that the cost of finding and fixing most defects rises the later we find them in the lifecycle, we are interested in techniques that find defects earlier in the lifecycle.

The early data from Fagan reported that 67% (1976) of the faults were found before unit test, using inspections and again 93% (1986) of all faults were found by inspections. Since then, representative results from a number of studies have supported the evidence that a large percent of the defects can be caught at earlier phases in the lifecycle, before the test process begins. For example:

 

·        (Collofello, Woodfield 1989) reported that 54% of the design defects were caught by design review,

·         [cms1] (Kusumoto, Matsumoto, Kikuno, Torii 1992) reported that design and code reviews caught 31% to 50% of the defects in the experimental projects at a training course at Nihon Unisys Ltd.,

·        (Tanaka, Sakamoto, Kusumoto, Matsumoto, Kikuno 1995) reported that code review caught 31.7% of the defects in a study at OMRON, and

·        (Conradi, Marjara, Skatevik 1999) showed that 64% of all registered defects were found by design inspections at Ericsson 1999. 

 

Thus the 60% number, which comes from the 1987 column, is still a reasonable estimate.

 

Evidence of the long range benefits of software inspections are best shown in the NASA space shuttle software study, where inspections were performed on requirements, design, code, test plans, specifications, and procedures for avionics software systems from 1982 to 1985. During this time operational defect rate was reduced from 2.25 to 0.08 defects/KSLOC, yielding a defect reduction rate of 95%. 

 

Other data in the literature by (Doolan 1992) and (Russell 1991) report 30 and 33 hours return for every hour devoted to inspection, respectively.

 

      There is also evidence that peer reviews, analysis tools, and testing catch different classes of defects at different points in the development cycle (Basili, Selby 1987). Further empirical research is needed to help choose the best mixed strategy for defect reduction investments.

 

7.  Perspective-based reviews catch 35% more defects than non-directed reviews.

 

      A scenario based reading technique (Basili, V. R., Evolving and Packaging Reading Technologies, Journal of Systems and Software, vol. 38, no. 1, pp. 3-12, July 1997) offers a reviewer a set of formal procedures for defect detection based upon varying perspectives. The union of several perspectives into a single inspection offers broad, yet focused coverage of the document being reviewed. The goal is to generate document and notation specific, focused techniques aimed at specific defect detection goals, taking advantage of the existing defect history in an organization.

 

Scenario-based reading techniques have been applied in requirements and object oriented design inspections, as well as user interface inspections. Improvement results vary from 15% to 50% in fault detection rate (Porter,Votta, Basili, TSE IEEE95), (Basili, Green, Laitenberger, Shull, Sorumgaard, Zelkowitz 1996), (Zhang, Basili, Shneiderman 99),  (Laitenberger 00).

 

Thus focusing an operational procedure for reading a software artifact on perspectives and defect classifications can increase the defect detection rate. More importantly, it allows the reviewer to focus on particular classes of defects that may be prevalent for that application and project type. Further benefits of explicit reading techniques are that they facilitate training of inexperienced personnel, better communication about the process, and continual improvement over time.

 

8.  Disciplined personal practices can reduce defect introduction rates by up to   75%.

 

      Several disciplined personal processes have been introduced into practice.  These include Harlan Mills’ Cleanroom software development process and Watts Humphrey’s Personal Software Process. Data from both of them support the concept that personal discipline can greatly reduce the introduction of defects into software products. Data from the use of Cleanroom at NASA have shown failure rates during test reduced by 25% to 75%. Use of Cleanroom also showed a reduction in rework effort, i.e., only 5% of the fixes took more than an hour to fix as opposed to the standard of over 60% of the fixes taking over an hour to fix.

 

The Personal Software Process (PSP) emphasizes best practices for sizing, estimating, planning, checking, reviewing, and controlling an individual's software content, budget, schedule, and quality.  Its strong focus on root-cause analysis of defects and overruns, and on developing personal checklists and practices to avoid future reoccurrence, has a significant effect on personal defect rates.  Reductions of 10:1 are common between exercises 1 and 10 of the PSP training course.

 

Effects at the project level are more scattered.  They depend on such factors as the organizations' existing software maturity level and the people's and organizations' willingness to operate within a highly structured software culture.  When PSP is coupled with the strongly compatible Team Software Process (TSP), defect reduction rates can be factors of 10 or higher for organizations operating at modest maturity levels, but less if organizations already have highly mature processes.  The June 2000 special issue of CrossTalk "Keeping Time with PSP and TSP," has a good set of relevant discussions, including experience showing that adding PSP and TSP to a CMM Level 5 organization reduced acceptance test defects by about 50% overall, and about 25% for high-priority defects.

 

9.        All other things being equal, it costs 50% more per source instruction to develop high-dependability software products than to develop low-dependability software products.  However, the investment is more than worth it if significant operations and maintenance costs are involved.

 

The analysis of 161 project data points for the COCOMO II model referenced above resulted in an added cost of 53% for its "Required Reliability" factor, while normalizing for the effects of 22 other factors.  Does this mean that Philip Crosby's landmark book, Quality Is Free (Mentor, 1980), had it all wrong?  Maybe for some low-criticality, short-lifetime software but not for the most important cases. 

 

First, in the COCOMO II maintenance model, low-dependability software is about 50% per instruction more expensive to maintain than to develop, while high-dependability software is about 15% less expensive to maintain than to develop.  For a typical life cycle cost distribution of 30% development and 70% maintenance, low-dependability software becomes about the same in cost per instruction as high-dependability software (again, assuming all other factors are equal).

 

Second, in the COCOMO II-related quality model, high-dependability software removes about 4 times as many defects as average-dependability software, which in turn removes about 4 times as many defects as low-dependability software.   Thus, if the operational cost of software defects (due to lost worker time, lost sales, recalls, added customer service costs, litigation costs, loss of repeat business, etc.) is roughly equal to life-cycle software development and maintenance costs for average-dependability software, the increased defect rate of low-dependability software will make its ownership costs roughly three times higher than the ownership costs of low-dependability software.

 

10.  About 40-50% of user programs enter use with nontrivial defects.

 

A landmark study in this area was P.S. Brown and J.D. Gould's, "An Experimental Study of People Creating Spreadsheets," (ACM Trans. Office Info. Sys., July 1987, pp. 258-272).  It found that 44% of 27 spreadsheet programs produced by experienced spreadsheet developers had nontrivial defects: mostly errors in spreadsheet formulas. The developers were quite confident that their spreadsheets were accurate.   Subsequent laboratory experiments have reported defective spreadsheet rates between 35% and 90%.  Analyses of operational spreadsheets have reported defectiveness rates between 21% and 26%; the lower rates are probably due to some operational defect elimination (Chan, Ying, Peh, 2000).

 

Nowadays and increasingly in the future, user programs will escalate from spreadsheets to Web/Internet scripting languages capable of sending agents into cyberspace to make deals for you.  And there will be many more "sorcerer's apprentice" user-programmers with tremendous power to create high-risk defects and little training or expertise in how to avoid or detect them.  One of our studies for the COCOMO II book (page 6) estimated that there would be 55 million user-programmers in the U.S. by the year 2005.  Including active Web-page developers as user-programmers, this prediction is basically on-track.

 

Thus, another challenge for the creators of web-programming facilities is to provide them with the equivalent of seat belts and air bags, plus safe-driving aids and rules of the road.  This is one of several software engineering research challenges identified by a National Science Foundation study, "Gaining Intellectual Control of Software Development," which we recently summarized in Computer (May 2000, pp. 27-33).

 

There is a great need to refine and expand this top-10 list and related empirical research on defect reduction.

 

Clearly, much of the data reported above does not take into account the interaction of many of the variables.  Some further things you would like to know, for example, are, “If I invest in peer reviewing , Cleanroom, and PSP, am I paying for the same defects to be removed three times?  Will this enable me to avoid doing (some) testing?”  Further empirical research in defect reduction is needed to be able to answer questions like these.

 

We hope to involve the software community in a process of expanding the top-10 defect reduction list and other currently-available data into a continually evolving, open-source, Web-accessible handbook of empirical results on software defect reduction strategies.  We also plan to initiate counterpart handbooks for COTS-based systems and other future software areas.  We would welcome your participation in this effort; please see the CeBASE web site (http://www.cebase.org) for further information and ways of participating.


 

Summary of Top Ten List

  1. Finding and fixing a software problem after delivery is often 100 times more expensive than finding and fixing it during the requirements and design phase.
  2. About 40-50% of the effort on current software projects is spent on avoidable rework.
  3. About 80% of the avoidable rework comes from 20% of the defects.
  4. About 80% of the defects come from 20% of the modules and about half the modules are defect free.
  5. About 90% of the downtime comes from at most 10% of the defects.
  6. Peer reviews catch 60% of the defects.
  7. Perspective-based reviews catch 35% more defects than non-directed reviews.
  8. Disciplined personal practices can reduce defect introduction rates by up to 75%.
  9. All other things being equal, it costs 50% more per source instruction to develop high-dependability software products than to develop low-dependability software products. However, the investment is more than worth it if significant operations and maintenance costs are involved.
  10. About 40-50% of user programs enter use with nontrivial defects.

 

 

(Adams 1984)

E. N. Adams, “Minimizing Cost Impact of Software Defects.” IBM Journal of Research and Development, vol. 28 no. 1, January 1984

 

(Endres 1975)

A. Endres, “An Analysis of Errors and Their Causes in System Programs,” Proceedings of the Internatinal Conference on Reliable Software, pp. 327 – 336, 1975.

 

(Basili, Perricone 1984)

Victor R. Basili and Barry Perricone, Software Errors and Complexity: An Empirical Investigation, Communication of the ACM, vol. 27, #1, pp 42-52, January 1984.

 

(Khoshgoftaar and Allen 1999)

Taghi M. Khoshgoftaar and Edward B. Allen, A Comparative Study of Ordering and Classification of Fault-Prone Software Modules, Empirical Software Engineering Journal, Volume 4, Number 2, pp. 159-186, June 1999.

 

(Fenton, Ohlsson 2000)

N. E. Fenton and N. Ohlsson, “Quantitative Analysis of Faults and Failures in a Complex Software System,” IEEE Transactions on Software Engineering, Volume 26, Number 8, pp.797 – 814, 2000.

 

(Basili, Selby, Phillips 1983)

Victor R. Basili, Richard Selby and Tsai-Yun Phillips, Metric Analysis and Data Validation Across FORTRAN Projects, IEEE Transactions on Software Engineering, vol. SE-9, #6, pp 652-663, November 1983.

 

(Briand, Basili, Hetmanski 1993)

Lionel C. Briand, Victor R. Basili, and Christopher J. Hetmanski, Developing Interpretable Models for Identifying High Risk Software Components, IEEE Transactions on Software Engineering, Volume 19, Number 11, pp 1028-1044, November 1993.

 

(Khoshgoftaar and Allen 98).

Taghi M. Khoshgoftaar and Edward B. Allen, Classification of Fault-Prone Software Modules: Prior Probabilities, Costs, and Model Evaluation, Empirical Software Engineering Journal, Volume 3, Number 3, pp. 275-298, September 1998.

 

(Collofello, Woodfield 1989)

Collofello J. S. and Woodfield S. N.  ``Evaluating the effectiveness of reliability-assurance techniques'', J. Syst. & Software, 9, 3, pp.191-195 (1989).

 

(Miller 1990)

(Kusumoto, Matsumoto, Kikuno, Torii 1992)

Shinji Kusumoto, Ken-ichi Matsumoto, Tohru Kikuno and Koji Torii:``A new metric for cost effectiveness of software reviews'', IEICE Transactions on Information and Systems, Vol. E75-D, No. 5, pp.674-680(Sept. 1992).

 

(Tanaka, Sakamoto, Kusumoto, Matsumoto, Kikuno 1995)

Toshifumi Tanaka, Keishi Sakamoto, Shinji Kusumoto, Ken-ichi Matsumoto and Tohru Kikuno:``Improvement of software process by process description and benefit stimation'', Proc. of the 17th International Conference on Software Engineering, pp.123-132(Seattle, April, 1995).

 

(Conradi, Marjara, Skatevik 1999)

A. Marjara, R. Conradi, and B. Skatevik, An Empirical Study of Inspection and Testing Data at Ericson, Proceedings of the 24th Annual Software Engineering Workshop, December 2, 1999, Greenbelt, Maryland.

 

(Doolan 1992)

E.P. Doolan, “Experiences with Fagan’s Inspection Method,” Software –Practice and Expereince, Vol. 22, No.2, pp. 173-182. February 1992.

 

(Russell 1991)

G. W. Russell, “Experience with Inspections in Ultralarge-Scale Developments,” IEEE Software, Vol. 8, No.1, pp. 25 – 31, January 1991.

 

(Basili 1997)

Basili, V. R., Evolving and Packaging Reading Technologies, Journal of Systems and Software, vol. 38, no. 1, pp. 3-12, July 1997

 

(Porter,Votta, Basili 1995)

A. A. Porter, L. G. Votta and V. R. Basili, Comparing Detection Methods for Software Requirements Inspections: A Replicated Experiment, IEEE Transactions on Software Engineering, Volume 21, Number 6, pp 563-575, June 1995.

 

(Basili, Green, Laitenberger, Shull, Sorumgaard, Zelkowitz 1996)

Victor R. Basili, Scott Green, Oliver Laitenberger, Forrest Shull, Sivert L. Sorumgaard and Marvin V. Zelkowitz, The Empirical Investigation of Perspective-based Reading, Empirical Software Engineering, An International Journal, Volume 1, Number 2, pp 133-164, Kluwer Academic Publishers, October 1996.

 

(Zhang, Basili, Shneiderman 99)

Z. Zhang, V. Basili, and B. Shneiderman, Perspective-based Usability Inspection: An Empirical Validation of Efficacy, Empirical Software Engineering: An International Journal, Volume 4, No. 1, March 1999.

 

(Laitenberger 00)

O. Laitenberger, C. Atkinson, M. Schlich, K. El Emam, An experimental comparison of reading techniques for defect detection in UML design documents, Journal of System and Software, 53 (2000), 183-204.

 

 

(Basili, Selby 1987)

Victor R. Basili and Richard Selby, Comparing the Effectiveness of Software Testing Strategies, IEEE Transactions on Software Engineering, pp 1278-1296, December 1987.

 

 

(Chen, Ying, Peh, 2000)

H.C. Chen, C. Ying, and C.B. Peh, “Strategies and Visualization Tools for Enhancing User Auditing of Spreadsheet Models,” Information and Software Technology, December 2000, pp. 1037-1043.


Vic wants to keep this reference and have Forrest continue to look for the particulars.