Click here
for a pdf version of the article.
Software Defect Reduction Top-10 List
Barry Boehm, USC and
Victor Basili, U. of Maryland
Recently, a grant from the National Science
Foundation's Information Technology Research program enabled us to establish a
national Center for Empirically-Based Software Engineering (CeBASE). The CeBASE objective is to transform
software engineering as much as possible from a fad-based practice to an
engineering-based practice through derivation, organization, and dissemination
of empirical data on software development and evolution phenomenology.
"As much as possible" reflects the
fact that software development will always remain a people-intensive and
continuously changing field. However,
in roughly 30 years each of empirical study of software phenomenology, we have
found that people in the field have been able to establish objective and quantitative
data, relationships, and predictive models which have helped hundreds of
thousands of software developers avoid predictable pitfalls and improve their
ability to predict and control efficient software projects.
As a way of illustrating this, we are devoting
this column to an update of one of our previous columns ("Industrial
Metrics Top-10 List," by Barry Boehm, IEEE Software September 1987, pp.
84-85) which provided a concise selection of empirical data which many software
practitioners found very helpful. As
CeBASE is focusing on two areas of major concern--software defect reduction and
COTS-based systems--we will provide a recent defect reduction top-10 list in
this column and a COTS-based systems top-10 list in a subsequent column. Here is the defect reduction list, in rough
priority order:
1. Finding and fixing a
software problem after delivery is often 100 times more expensive than finding
and fixing it during the requirements and design phase.
This was also the top-priority item in the 1987
list. As in 1987, "This insight has
been a major driver in focusing industrial software practice on thorough
requirements analysis and design, on early verification and validation, and on
up-front prototyping and simulation to avoid costly downstream fixes."
The only thing we have changed since 1987 is to
add the word "often," to reflect additional insights on the
relationship. For one, the
cost-escalation factor for small, noncritical software systems is more like 5:1
than 100:1, enabling such systems to be developed most efficiently in a less
formal, "continuous prototype" mode -- but still with emphasis on
getting things right early rather than late.
Another is that the cost-escalation factor can be reduced significantly
even for large critical systems via good architectural practices. These reduce the cost of most fixes by
confining them to small, well-encapsulated modules. An excellent example was the million-line TRW CCPDS-R project
described in Appendix D of Walker Royce's Software Project Management: A
Unified Approach, Addison-Wesley, 1988, where the cost-escalation factor
was only about 2:1.
There is NASA data on
inspections and testing that supports the premise that finding defects earlier
in the project development cycle is cheaper than finding it later. At JSC, it
was two times easier to find a defect during inspection than to find it during
test. I.e., finding it early in
project averaged 1.2 hours for easy defects, and 1.4 hours for hard defects,
while finding a defect late in the project took 1.5 hours for easy defects and
3 hours for hard defects. At JPL, fixing a defect found in inspections took an
average of .7 hours, while fixing it during test took 5 to 18 hours of effort.
2. About 40-50% of the effort
on current software projects is spent on avoidable rework.
"Avoidable
rework" is effort spent fixing difficulties with the software that could
have been avoided or discovered earlier and less expensively. This implies that there is such a thing as
"unavoidable rework." This
fact has been increasingly appreciated with the growing realization that better
user-interactive systems result from "emergent" processes (where the
requirements emerge from prototyping and other multi-stakeholder shared
learning activities) than from "reductionist" processes (where the
requirements are stipulated in advance and then reduced to practice via design
and coding).
We
believe that this distinction is essential to a modern theory and practice of
software defect reduction. Changes to
the definition of a system that make it more cost-effective should not be
discouraged by classifying them as defects to be avoided. This kind of discouragement usually results
in belated recognition that the system is unsatisfactory and large amounts of
avoidable rework. On the other hand,
projects need to avoid continuous streams of arbitrary changes, which are not
cost-effective either.
Reducing
avoidable rework is thus a major source of software productivity
improvement. In our behavioral analysis
of the effects of software cost drivers on effort for the COCOMO II model (B.
Boehm et al., Software Cost Estimation
with COCOMO II, Prentice Hall, 2000) most of the effort savings from
improving software process maturity, software architectures, and software risk
management came from reductions in avoidable rework.
3. About
80% of the avoidable rework comes from 20% of the defects.
For
smaller systems, the 80% number may be lower; for very large systems, it may be
higher. On one large TRW-government system,
a single avoidable defect in the requirements caused over 100-person years of
rework (B. Boehm, "A Large Sequential-Engineering Near-Disaster,"
IEEE Computer, March 2000,
p.115). Two major sources of avoidable
rework are hastily-specified requirements and nominal-case design and
development (where late accommodation of off-nominal requirements causes major
architecture, design, and code breakage).
Further, if you have a software problem report tracking system which
records the effort to fix each defect, it is fairly easy for you to analyze the
data to determine and address additional major sources of rework in your
organization.
4. About 80% of the defects come from 20% of
the modules and about half the modules are defect free.
There
have been several studies over the years that have aimed at identifying high
risk components. These studies typically collect defect data from system test
through acceptance test to some period of operation. Representative results are of the form: 76% of the faults come
from 20% of the modules (Endres 1975), 83% of the faults come from 12% of
modules (Basili, Perricone 1984), 82% of the faults come from 20% of the
modules (Khoshgoftaar and Allen 1999), 60% of the faults come from 20% of the
modules (Fenton, Ohlsson 2000). The data from different environments over many
years is amazingly consistent.
What
also appears to be consistent is that all of the defects are contained in about
half of the modules. This data is representative of each of the above studies,
as well as by (Basili, Selby, Phillips 1983), (Briand, Basili, Hetmanski 1993),
(Khoshgoftaar and Allen 98).
Thus,
it is worth the effort to identify the characteristics of error prone modules
in a particular environment. There are a variety of factors that contribute to
error-proneness that appear to be context dependent. However, some factors that
contribute to error-proneness are the level of data coupling and cohesion,
size, complexity, and amount of change to reused code.
5. About
90% of the downtime comes from at most 10% of the defects.
It
is obvious that all faults are not equal in terms of their rate of occurrence.
That is, some defects have a disproportionate effect on downtime and
reliability of a system than others. In analyzing the software failure history
of nine large IBM software products, (Adams 1984) (E. N. Adams, “Minimizing
Cost Impact of Software Defects.” IBM Journal of Research and Development, vol.
28 no. 1, January 1984) found a wide range of failure rates (measured in usage
months) and a high percentage of very low rate errors. Based upon the data from those projects,
about .3% of the defects account for about 90% of the downtime. Thus
understanding the operational profiles of a system and testing according to
that profile is clearly cost effective.
6. Peer reviews catch 60% of the defects.
Given
that the cost of finding and fixing most defects rises the later we find them
in the lifecycle, we are interested in techniques that find defects earlier in
the lifecycle.
The early data from Fagan reported that 67% (1976) of the faults were
found before unit test, using inspections and again 93% (1986) of all faults
were found by inspections. Since then, representative results from a number of
studies have supported the evidence that a large percent of the defects can be
caught at earlier phases in the lifecycle, before the test process begins. For
example:
·
(Collofello, Woodfield 1989) reported that 54% of the design defects were
caught by design review,
·
[cms1](Kusumoto,
Matsumoto, Kikuno, Torii 1992) reported that design and code reviews caught 31% to 50% of the defects in the experimental projects at a training course
at Nihon Unisys Ltd.,
·
(Tanaka, Sakamoto, Kusumoto, Matsumoto, Kikuno
1995) reported that code review caught 31.7% of the defects in a study at
OMRON, and
·
(Conradi,
Marjara, Skatevik 1999) showed that 64% of all registered defects were found by
design inspections at Ericsson 1999.
Thus the 60% number, which comes from the 1987 column, is still a
reasonable estimate.
Evidence of the long range benefits of software
inspections are best shown in the NASA space shuttle software study, where
inspections were performed on requirements, design, code, test plans,
specifications, and procedures for avionics software systems from 1982 to 1985.
During this time operational defect rate was reduced from 2.25 to 0.08
defects/KSLOC, yielding a defect reduction rate of 95%.
Other data in the literature by (Doolan 1992)
and (Russell 1991) report 30 and 33 hours return for every hour devoted to
inspection, respectively.
There
is also evidence that peer reviews, analysis tools, and testing catch different
classes of defects at different points in the development cycle (Basili, Selby
1987). Further empirical research is needed to help choose the best mixed
strategy for defect reduction investments.
7. Perspective-based reviews catch 35% more
defects than non-directed reviews.
A
scenario based reading technique (Basili, V. R., Evolving and Packaging Reading Technologies,
Journal of Systems and Software, vol. 38, no. 1, pp. 3-12, July 1997) offers a
reviewer a set of formal procedures for defect detection based upon varying
perspectives. The union of several perspectives into a single inspection offers
broad, yet focused coverage of the document being reviewed. The goal is to
generate document and notation specific, focused techniques aimed at specific
defect detection goals, taking advantage of the existing defect history in an
organization.
Scenario-based reading techniques have been
applied in requirements and object oriented design inspections, as well as user
interface inspections. Improvement results vary from 15% to 50% in fault
detection rate (Porter,Votta, Basili, TSE IEEE95), (Basili, Green,
Laitenberger, Shull, Sorumgaard, Zelkowitz 1996), (Zhang, Basili, Shneiderman
99), (Laitenberger 00).
Thus focusing an operational procedure for
reading a software artifact on perspectives and defect classifications can
increase the defect detection rate. More importantly, it allows the reviewer to
focus on particular classes of defects that may be prevalent for that application
and project type. Further benefits of explicit reading techniques are that they
facilitate training of inexperienced personnel, better communication about the
process, and continual improvement over time.
8. Disciplined personal practices can reduce
defect introduction rates by up to
75%.
Several
disciplined personal processes have been introduced into practice. These include Harlan Mills’ Cleanroom
software development process and Watts Humphrey’s Personal Software Process.
Data from both of them support the concept that personal discipline can greatly
reduce the introduction of defects into software products. Data from the use of
Cleanroom at NASA have shown failure rates during test reduced by 25% to 75%.
Use of Cleanroom also showed a reduction in rework effort, i.e., only 5% of the
fixes took more than an hour to fix as opposed to the standard of over 60% of
the fixes taking over an hour to fix.
The Personal Software Process (PSP) emphasizes
best practices for sizing, estimating, planning, checking, reviewing, and
controlling an individual's software content, budget, schedule, and
quality. Its strong focus on root-cause
analysis of defects and overruns, and on developing personal checklists and
practices to avoid future reoccurrence, has a significant effect on personal
defect rates. Reductions of 10:1 are
common between exercises 1 and 10 of the PSP training course.
Effects at the project level are more
scattered. They depend on such factors
as the organizations' existing software maturity level and the people's and
organizations' willingness to operate within a highly structured software
culture. When PSP is coupled with the
strongly compatible Team Software Process (TSP), defect reduction rates can be
factors of 10 or higher for organizations operating at modest maturity levels,
but less if organizations already have highly mature processes. The June 2000 special issue of CrossTalk
"Keeping Time with PSP and TSP," has a good set of relevant
discussions, including experience showing that adding PSP and TSP to a CMM
Level 5 organization reduced acceptance test defects by about 50% overall, and
about 25% for high-priority defects.
9.
All other things being equal, it costs 50% more
per source instruction to develop high-dependability software products than to
develop low-dependability software products.
However, the investment is more than worth it if significant operations
and maintenance costs are involved.
The analysis of 161 project data points for the
COCOMO II model referenced above resulted in an added cost of 53% for its
"Required Reliability" factor, while normalizing for the effects of
22 other factors. Does this mean that
Philip Crosby's landmark book, Quality Is Free (Mentor, 1980), had it
all wrong? Maybe for some low-criticality,
short-lifetime software but not for the most important cases.
First, in the COCOMO II maintenance model,
low-dependability software is about 50% per instruction more expensive to
maintain than to develop, while high-dependability software is about 15% less
expensive to maintain than to develop.
For a typical life cycle cost distribution of 30% development and 70%
maintenance, low-dependability software becomes about the same in cost per
instruction as high-dependability software (again, assuming all other factors
are equal).
Second, in the COCOMO II-related quality model,
high-dependability software removes about 4 times as many defects as
average-dependability software, which in turn removes about 4 times as many
defects as low-dependability software.
Thus, if the operational cost of software defects (due to lost worker
time, lost sales, recalls, added customer service costs, litigation costs, loss
of repeat business, etc.) is roughly equal to life-cycle software development
and maintenance costs for average-dependability software, the increased defect
rate of low-dependability software will make its ownership costs roughly three
times higher than the ownership costs of low-dependability software.
10. About 40-50% of user programs enter use with nontrivial
defects.
A landmark study in this area was P.S. Brown
and J.D. Gould's, "An Experimental Study of People Creating
Spreadsheets," (ACM Trans. Office Info. Sys., July 1987, pp.
258-272). It found that 44% of 27 spreadsheet
programs produced by experienced spreadsheet developers had nontrivial defects:
mostly errors in spreadsheet formulas. The developers were quite confident that
their spreadsheets were accurate.
Subsequent laboratory experiments have reported defective spreadsheet
rates between 35% and 90%. Analyses of
operational spreadsheets have reported defectiveness rates between 21% and 26%;
the lower rates are probably due to some operational defect elimination (Chan,
Ying, Peh, 2000).
Nowadays and increasingly in the future, user
programs will escalate from spreadsheets to Web/Internet scripting languages
capable of sending agents into cyberspace to make deals for you. And there will be many more "sorcerer's
apprentice" user-programmers with tremendous power to create high-risk
defects and little training or expertise in how to avoid or detect them. One of our studies for the COCOMO II book
(page 6) estimated that there would be 55 million user-programmers in the U.S.
by the year 2005. Including active
Web-page developers as user-programmers, this prediction is basically on-track.
Thus, another challenge for the creators of
web-programming facilities is to provide them with the equivalent of seat belts
and air bags, plus safe-driving aids and rules of the road. This is one of several software engineering
research challenges identified by a National Science Foundation study,
"Gaining Intellectual Control of Software Development," which we
recently summarized in Computer (May 2000, pp. 27-33).
There is a great need to refine and expand this top-10 list and related
empirical research on defect reduction.
Clearly, much of the data reported above does not
take into account the interaction of many of the variables. Some further
things you would like to know, for example, are, “If I invest in peer reviewing
, Cleanroom, and PSP, am I paying for the same defects to be removed three
times? Will this enable me to avoid
doing (some) testing?” Further empirical
research in defect reduction is needed to be able to answer questions like
these.
We hope to involve the software community in a
process of expanding the top-10 defect reduction list and other
currently-available data into a continually evolving, open-source,
Web-accessible handbook of empirical results on software defect reduction
strategies. We also plan to initiate
counterpart handbooks for COTS-based systems and other future software
areas. We would welcome your
participation in this effort; please see the CeBASE web site (http://www.cebase.org) for further
information and ways of participating.
|
Summary of Top Ten List |
|
|
(Adams 1984)
E. N. Adams, “Minimizing Cost Impact of Software Defects.” IBM Journal
of Research and Development, vol. 28 no. 1, January 1984
(Endres 1975)
A. Endres, “An Analysis of Errors and Their
Causes in System Programs,” Proceedings of the Internatinal Conference on
Reliable Software, pp. 327 – 336, 1975.
(Basili, Perricone 1984)
Victor R. Basili and Barry Perricone, Software
Errors and Complexity: An Empirical Investigation, Communication of the
ACM, vol. 27, #1, pp 42-52, January 1984.
(Khoshgoftaar and Allen 1999)
Taghi M. Khoshgoftaar and Edward B. Allen, A
Comparative Study of Ordering and Classification of Fault-Prone Software
Modules, Empirical Software Engineering Journal, Volume 4, Number 2, pp.
159-186, June 1999.
(Fenton, Ohlsson 2000)
N. E. Fenton and N. Ohlsson, “Quantitative
Analysis of Faults and Failures in a Complex Software System,” IEEE
Transactions on Software Engineering, Volume 26, Number 8, pp.797 – 814, 2000.
(Basili, Selby, Phillips 1983)
Victor R. Basili, Richard Selby and
Tsai-Yun Phillips, Metric Analysis and Data Validation Across FORTRAN Projects,
IEEE Transactions on Software Engineering, vol. SE-9, #6, pp 652-663, November
1983.
(Briand, Basili, Hetmanski 1993)
Lionel C. Briand, Victor R. Basili, and
Christopher J. Hetmanski, Developing
Interpretable Models for Identifying High Risk Software Components, IEEE
Transactions on Software Engineering, Volume 19, Number 11, pp 1028-1044,
November 1993.
(Khoshgoftaar and Allen 98).
Taghi M. Khoshgoftaar and Edward B. Allen,
Classification of Fault-Prone Software Modules: Prior Probabilities, Costs, and
Model Evaluation, Empirical Software Engineering Journal, Volume 3, Number 3,
pp. 275-298, September 1998.
(Collofello, Woodfield 1989)
Collofello J. S. and
Woodfield S. N. ``Evaluating the
effectiveness of reliability-assurance techniques'', J. Syst. & Software,
9, 3, pp.191-195 (1989).
(Miller 1990)
(Kusumoto, Matsumoto,
Kikuno, Torii 1992)
Shinji Kusumoto, Ken-ichi Matsumoto, Tohru Kikuno
and Koji Torii:``A new metric for cost effectiveness of software reviews'',
IEICE Transactions on Information and Systems, Vol. E75-D, No. 5,
pp.674-680(Sept. 1992).
(Tanaka, Sakamoto,
Kusumoto, Matsumoto, Kikuno 1995)
Toshifumi Tanaka, Keishi Sakamoto, Shinji
Kusumoto, Ken-ichi Matsumoto and Tohru Kikuno:``Improvement of software process
by process description and benefit stimation'', Proc. of the 17th International
Conference on Software Engineering, pp.123-132(Seattle, April, 1995).
(Conradi, Marjara, Skatevik 1999)
A. Marjara, R. Conradi, and B. Skatevik, An Empirical Study of
Inspection and Testing Data at Ericson, Proceedings of the 24th
Annual Software Engineering Workshop, December 2, 1999, Greenbelt, Maryland.
(Doolan 1992)
E.P. Doolan, “Experiences with Fagan’s
Inspection Method,” Software –Practice and Expereince, Vol. 22, No.2, pp.
173-182. February 1992.
(Russell 1991)
G. W. Russell, “Experience with Inspections in
Ultralarge-Scale Developments,” IEEE Software, Vol. 8, No.1, pp. 25 – 31,
January 1991.
(Basili 1997)
Basili, V. R., Evolving and Packaging Reading Technologies,
Journal of Systems and Software, vol. 38, no. 1, pp. 3-12, July 1997
(Porter,Votta, Basili 1995)
A. A. Porter, L. G. Votta and V. R.
Basili, Comparing
Detection Methods for Software Requirements Inspections: A Replicated
Experiment, IEEE Transactions on Software Engineering, Volume 21, Number 6,
pp 563-575, June 1995.
(Basili, Green, Laitenberger, Shull, Sorumgaard, Zelkowitz 1996)
Victor R. Basili, Scott Green, Oliver Laitenberger, Forrest Shull,
Sivert L. Sorumgaard and Marvin V. Zelkowitz, The
Empirical Investigation of Perspective-based Reading, Empirical Software
Engineering, An International Journal, Volume 1, Number 2, pp 133-164, Kluwer
Academic Publishers, October 1996.
(Zhang, Basili, Shneiderman 99)
Z. Zhang, V. Basili, and B.
Shneiderman, Perspective-based
Usability Inspection: An Empirical Validation of Efficacy, Empirical
Software Engineering: An International Journal, Volume 4, No. 1, March 1999.
(Laitenberger 00)
O. Laitenberger, C. Atkinson, M. Schlich, K. El Emam, An experimental
comparison of reading techniques for defect detection in UML design documents, Journal of System and Software, 53
(2000), 183-204.
(Basili, Selby 1987)
Victor R. Basili and Richard Selby, Comparing
the Effectiveness of Software Testing Strategies, IEEE Transactions on
Software Engineering, pp 1278-1296, December 1987.
(Chen, Ying, Peh, 2000)
H.C. Chen, C. Ying, and C.B. Peh, “Strategies and Visualization Tools for
Enhancing User Auditing of Spreadsheet Models,” Information and Software
Technology, December 2000, pp. 1037-1043.