eWorkshop on Software Inspections and Pair Programming

CeBASE and Visek conducted a joint eWorkshop on December 16, 2003, to discuss the practices of pair programming and software inspections. Although in many ways dissimilar, both practices have the common aim of supporting the development of quality software, with minimal defects, through structured collaboration among developers/reviewers. In fact, one of the motivating factors behind the development of pair programming was to increase the effectiveness of code inspections by moving them earlier in the development lifecycle and doing them “all the time[1].” We were very interested in investigating whether pair programming (PP) succeeds in its goals of providing the same or improved benefits as inspections, with what cost, and in general whether the two practices were complementary and under what circumstances they each made the most sense.

 

Tom Gilb may have summed up the goals of the discussion best when he said, “My position is that they are two different and complementary techniques. We need to understand their costs and benefits quantitatively, and their best practice modes” [78, agreed to by Denger, Basili, Arisholm, Wiegers]. To achieve this, we were happy to have the input of a very lively set of over 20 participants from 5 different countries and 6 different time zones.

 

Summary: There was a lot of good discussion but little consensus among participants. Although many suggestions were made as to features of both practices, few participants supported or refuted each other’s statements. Our discussion seemed very effective at raising the important points of comparison between the two practices, but there may not yet be enough data or experiences to permit a useful evaluation along all these dimensions. There was at least an informal consensus that the two practices should be complementary rather than exclusionary, under the right circumstances, but more experience is necessary to refine their exact contributions.

 

Details: During the discussion, the following important points were raised as areas of comparison:

 

Effects on quality (defect slippage)

One of the main claims of both PP and inspections is that they raise the quality of the product. One direct measure of quality is the number of defects in a product.

  • Participants agreed that both practices focus on improving quality, which in this discussion seemed to be measured by the number of software defects that slip through the practice.
  • Premeeting feedback summarized by Dieter Rombach seemed to indicate that inspection allowed higher quality (lower defect slippage) to be achieved, but at a higher cost than PP. Conversely, PP could achieve some level of software quality very easily, but it was very hard to get very high quality using this practice.
  • This was one question for which there was a large body of data, at least regarding inspections. Inspection benefits are well documented [Gilb 197] especially with respect to reduced defect slippage [confirmed by vote 260]. There is even data regarding which types of defects inspections are better at detecting [confirmed by vote 269], for example: Inspections are effective for finding defects such as programming blunders, logic errors, interface errors, and omissions, but not so effective for errors of timing, program dynamics, and numerical approximation [Boehm 242].
  • In contrast, while PP is commonly hypothesized to lead to lower defect injection rates and reduced defect slippage [confirmed by vote: 295], PP benefits seem to be not well documented especially with respect to reduced defect slippage [confirmed by vote: 273]
  • There was some evidence that PP has a positive effect on other types of quality than correctness, in particular, that it leads to better maintainable code. According to a small pilot experiment conducted in industry, pairs did produce solutions that were assessed to have better maintainability than solo developers [Arisholm 192].

 

Feedback cycle

The feedback cycle is the amount of time between committing a defect in software development, and detecting and removing that defect.

·        Participants argued that PP has a shorter feedback cycle.

o       This is a strength because the feedback has a greater ‘present value,’ i.e. it’s more cost effective overall to correct something as soon as it enters the system, than to correct it once it has spent some time in the system and possibly led to further problems [Krebs 111, Gilb 139, Ambler 164, McConnell 314]

o       As a result, the feedback cycle is “much more personal and individual” [Manzo 166]. [However, it wasn’t clear if this is a strength or a weakness: People may learn better when learning from their own personal mistakes, or it may slow down team learning by relying on personal feedback.]

·        Inspections have a longer feedback cycle than PP, which is a weakness because:

o       Inspection might wait until “too much damage is done,” while PP gives immediate feedback during development [Gilb 139].

o       Moreover, “if people think the work product is done, they can be psychologically resistant to making changes suggested by inspection.” Therefore it’s better to remove defects quickly, when it’s clear the product is still under construction, either by PP or by early incremental peer review [Wiegers 399].

·        The implication seemed to be that this weakness of inspections was due to the fact that inspections have an associated significant cost and time requirement, and hence have to wait until a product is stable [Rumpe 378], while PP can be applied earlier in the development process.

·        There are ways to mitigate the long feedback cycle in an inspection: A good heuristic is to start inspections as soon as 10% of the document is available, rather than waiting until the whole document is done at which point it may be more costly to repair all the defects. [Wiegers 159]. [However, this bears the risk that the review might address defects that are no longer important in a further iteration of the product (when another 10% of the document is done, the defect might be resolved anyway)]

·        Because the PP feedback cycle is so short, there was a side discussion about whether to describe the PP contribution as very fast defect detection and removal, or as defect prevention.

·         Some participants focused on one person in the pair detecting his/her partner’s defects: “PP… shortens the cycle time in detecting the defect to practically zero because the other person in the pair sees the error quickly” [McConnell 314].

·         Other participants focused on PP as a collaborative effort: “When a pair works together, they brainstorm and negotiate quite a bit (for a few seconds/a few minutes) before the keyboard is touched. I feel that is defect prevention” [Williams 321].

 

Third party perspective

The third party perspective is an additional view on the document under inspection; that is, people who are not directly related to the document under inspection provide another view on the product’s quality.

·        The premeeting feedback indicated that a perceived strength of inspections is that they can be more objective because they provide a third-party perspective, i.e. they provide feedback regarding system quality from technical personnel who may not have been responsible for its construction.

·        A related point is that the third-party inspectors can be chosen so as to maximize quality checking on certain attributes. In this way, inspections also allow incorporating multiple quality foci or perspectives. “One advantage of inspections is that you can work on multiple qualities. Perspective-based inspections enable artifacts to be reviewed by experts in safety, usability, performance, etc.” [Boehm 213, agreement from: Denger, Rumpe, Wiegers, Ambler].

·        However, an associated danger is that developers are afraid of being embarrassed in front of outside inspectors, and so expend too much energy perfecting the product before asking for an outside perspective [Wiegers 421].

·        PP, conversely, lacks the external, 3rd party perspective of a reviewer who isn’t “immersed in the product, and [hasn’t] absorbed all of its assumptions.” An outside perspective, although slower, often reveals insights that people too close to the work didn’t spot [Wiegers 172]

·        Some of the benefits of outside, objective, or focused reviewers can be achieved via pair rotation and collective ownership [Maurer 351, Ambler 170, 180], or by augmenting PP with other agile methods like “proving it with code[2]” [Ambler 356, 360].

·        There was some discussion over how serious this lack is in PP. One participant felt that without allowing the involvement of multiple, important quality viewpoints, converging on system requirements is likely to be problematic: “From a stakeholder win-win perspective, just getting two people to determine the correctness of requirements is very risky, as it excludes success-critical stakeholders from the process” [Boehm 523, agreed by Lanubile, Wiegers, Basili, Rombach.] Other participants agreed but said that this is why PP should never be implemented without more than one pair on the team, collective ownership of code, and the rotation of people through work pairs/groups [Ambler].

 

Learning/sharing knowledge

Another subject that participants concentrated on was the contribution made by the practices to increasing the skills and knowledge of people on the team.

  • Participants felt that both practices are good for mentoring junior people, who can quickly learn what other team members classify as good and bad practices. [Inspections: Denger 165, Lanubile 155, Basili 157; PP: Boehm 152]
  • Participants also felt that both practices help disseminate tacit knowledge among the team members [PP Boehm 117, Inspection: Basili 133] and therefore support learning issues [Lanubile 140]
  • There was some disagreement among participants as to whether one or the other practice was better suited for achieving learning. Some felt that PP’s shorter feedback loop made it more effective for learning (“the learning effect is better when discussing decisions when made instead of seeing results,” Rumpe 88) while others strongly disagreed (Wiegers 82: “You can learn how to do better work any time you look over someone else's shoulder, via PP or peer review,” also Basili 77).
  • The only data on the subject was for inspections: “Our industrial data is that, at the individual person level, once the defect-found feedback has worked… the individual can systematically inject two orders of magnitude fewer defects in their daily work” [Gilb 336].

 

Repeatability

The lead discussant, Dieter Rombach, asked which of the practices was more repeatable. That is, assuming that a particular development team performed a similar task again, what is the likelihood that they deliver the same quality in the same timeframe/effort using solo programming and inspections, versus using pair programming?

·        Vic Basili noted that this question investigates the likelihood that there would ever be sufficient predictive capability for the practices to be able to make accurate estimates about cost, quality, schedule, etc., based on past history.

·        Ambler felt that PP is more repeatable, since if you keep many of the team members together, then they will have built a common culture and a way of working that will be much more effective. [Ambler 305].

·        Other participants felt that inspections are more repeatable, and that the use of PP made it harder to build predictive models [Rumpe, Williams, Basili 297, Gilb 304, confirmed by vote 323 on “PP more repeatable than CI:” 1 yes, 6 no, 6 not sure]. One reason is that the effects of PP are harder to quantify [Williams 297]. Another reason might be that in Agile projects data collection is often not performed (see next discussion point). However, there wasn’t more evidence stated regarding the repeatability.

 

Measurement

Participants discussed which of the two practices was more amenable to measurement which could quantify the effect on the development process.

  • Participants agreed that inspections do leave a clear audit trail describing their results [Krebs 148].
  • Tom Gilb felt that this audit trail for inspections gives a statistical basis for managing the whole software engineering process: “I now believe that Inspection should be used to sample and measure, not to try to clean up.” The measurement of major defects found during inspection should be used to decide on appropriate next steps for the development of the product, and to motivate people to follow the standards used to judge the specification [Gilb 72].

 

Acceptance

One particularly important aspect of a technology is how well the developers accept it; that is, how likely it is that the developers adopt the technology and keep it in place over time.

  • Although nobody knew of any systematic studies, there was anecdotal evidence that developers seem more accepting (i.e. find it easier to keep the practice going on its own merits) of PP than inspections.
    • “I rarely talked to a developer who was keen on doing an inspection. I met many developers who love pair programming. So, why not simply accept these preferences?” [Maurer 95]
    • “When I was in industry, I found that people did not prepare as well as they should prior to an inspection. The inspection seemed more of a technicality -- something that needed to be ‘checked off.’ I'm sure our results wouldn't have matched most of the research studies.” [Williams 76]
    • Pairing may motivate people to tackle more difficult challenges [Krebs 386], as they do not feel that they need to come up with a solution for a difficult problem by their own.

·        As pointed out by Basili, however, both practices can surely be found being done badly in industry [Basili 87] – so basing conclusions on anecdotal evidence is dangerous.

Cost

“Cost” here is mainly a function of developer effort: the number of hours required for developing the system, finding and reworking defects, etc.

·        Outside of defect reduction, most data so far indicate that pair programming is also a way to trade extra cost for reduced schedule, which is often valuable in itself. [Boehm 99, 179]

·        Inspections can be a bottleneck in the development process: “Also, I’ve found Inspections to be too inefficient (defect yield per staff hour) and perhaps even worse, they slow the project’s natural rhythm requiring much staff energy to regain momentum.” [Manzo 103]

 

(Formality of) the process

Process formality refers to the level of specificity at which developer activities are defined; a more formal practice is expected to have more process steps that are described in greater detail, while a less formal practice would have fewer steps and rely more on developers’ own expertise to fill in the gaps.

·        The premeeting consensus was that PP is less formal than inspections, but it would be a mistake to assume that there was no formality to PP. As Bill Krebs said: “There is some formality to pairing in that there is a set 'algorithm' for doing it per Laurie William's book. Also, we rotate pairs, iterate, and refactor to address the risk that the first two folks will miss a bug.”

 

Comparing the practices

The above topics raised some important points about the dimensions where it may be useful to think about comparing the practices.

  • Arisholm reminded the participants that a comparison of inspections versus pair programming based on only some of those attributes would be easy to see as biased because it might not account for the areas of most potential benefits of one practice or the other [Arisholm 587]
  • Some suggested comparisons:

o       Boehm suggested that each practice has tradeoffs between cost and schedule.

o       Gilb suggested using inspection as a way to measure the difference between groups using PP and control groups [Gilb]

 

Existing Data

Concerning the use of PP there is only relatively little data published; that is, only few publications document the benefits of PP regarding the issues discussed above. It seems that most of the data regarding PP is anecdotal [Rumpe 238, Manzo 229].

·         In design [Williams; Arisholm 230, 257; Gilb 233]:

o        “In my initial PP study, I have a break out of use by phase. Use of PP was highest in design. However, I don't have any evidence of design isolated -- just by phase for all of development.” [Williams]

o        We have some evidence on the perception of developers/students that they create better designs using pair programming [Maurer]

o        For other evidence see: Flor & Hutchins: "Analysing Distributed Cognition in Software Teams: A Case Study of Team Programming During Perfective Software Maintenance", proc. 4th workshop on Empirical Studies of Programmers, pp. 36-64, 1991 [Arisholm]

·         General evidence:

o        “We told our small team you must either pair, or use inspections, or justify why you did solo programming. We got 48% of the changes paired, 50% solo, 2% with informal multi-person review. Little unit test. 2x improvement in quality as compared to earlier days of lower pairing frequency.” [Krebs 199]

o        “The first PP experiment I did showed statistically significant higher quality resulting from PP than from solo desk checking (not inspections).” [Williams 209]

·         Concerning the use of inspections most of the participants agreed that the benefits (the value) of inspections are well documented in a set of publications. In addition, some participants mentioned projects where they documented the benefits of inspections.

o        “One of my clients did requirements inspections for 5 years and measured a sustained ROI of 10:1.” [Wiegers]

o        “A good paper in latest issue of Software Quality Professional showing inspections reducing defect leakage to customers from 10.6/KLOC to 0.9/KLOC.” [Wiegers 203]

o        “…in rough terms the effectiveness of Inspections (if properly done at proper rates of checking like one page/hour) are about 75-90% effective for requirements and design but are in the 15-60% range for code. I have seen higher numbers from Capers Jones than 60% but I am not sure I trust them. (People don't seem to be good at understanding remaining defects downstream).” [Gilb 472]

o        “A recent Banking client using requirements inspections has about 88 majors/page before motivation and measurement and after a few months is at about 11 majors/page, and we expect this to drop to less than 1 major/page, within months.” [Gilb 341]

o        “Our results for inspections at TRW were about the same: in the 60% effectiveness range for both design and code.” [Boehm 482]

Another issue that was briefly discussed in the eWorkshop was the use of one practice in the homeground of the other practice. The participants discussed the circumstances and conditions under which it might be valuable to apply inspections in an agile project in addition to PP and the application of PP in a CMMI context instead of code inspections

 

PP/Inspection “homegrounds”

The participants agreed that in some cases it is valuable to perform extra inspections after PP. However, there is more research needed to define a process that gives explicit guidance under which circumstances it is valuable to do so.

 

  • PP usually resides in a process with a rather high level of evolution (refactoring, rework). Only when a module becomes stable, can we do inspection. [Rumpe, 378, Rombach+, Ambler+, Wiegers+, Krebs+]
  • I see this as an economic issue. If the risk exposure due to unfound defects is high, doing the inspection is worthwhile. If the risk exposure is low, then it's not worthwhile. [Boehm 385]

-        Ray Madachy's calibrated system dynamics model indicated that the net payoff for inspections went negative as the defect density for the inspected artifact decreased [Boehm 607]

  • I would like to see a process, where the team (leader) is able to decide rather dynamically, where to use extra inspections, based on complexity, criticality, degree of innovation. [Rumpe 412]

 

PP in a CMM(I) environment

The participants also discussed the potential application of PP in a CMM or CMMI context. Although there were no specific experiences discussed, most comments seemed to indicate that there was no reason the two approaches would not be compatible:

  • There isn’t any overt connection between PP and plan-driven or CMM approaches. PP is simply a "good practice" that could and should be selectively applied in an appropriate environment after some thought. [Wiegers 528]
  • CMMI generalized the Peer Reviews process area to "Verification," (CMMI level 3) so theoretically PP can help address this key area [Barry Boehm 584]
  • Pairing improves an organization’s learning culture and may this be an interesting element of an optimizing organization (CMMI level 5) [Rumpe 591]
  • We have a theory-based study that says XP meets the CMM’s level 2 requirements. But that one is rather conservative; Paul himself is more optimistic [Rumpe 602]
  • Applying PP does not mean that process-driven practices like inspections wouldn’t also be useful: “The point here is I think that if people pair 100% pairing is at its limit. The resulting defect rate cannot be reduced through further pairing (alone) -- but through additional inspections, as they are done post-construction.” [Rumpe 582]

 

Topics for future eWorkshops or Studies

One main result of the eWorkshop is the identification of potential research fields where a more detailed analysis of the practices should be performed. Based on the discussion during the eWorkshop and the pre-meeting feedback, the following ideas were generated:
 

·        When comparing the techniques, one should also consider the complexity of the module being developed or inspected. If there is any truth in existing results from group dynamics, complex tasks are best performed in solo (i.e., inspections) whereas simpler tasks are performed efficiently in groups (i.e., PP) [Arisholm 127]

·        In future eWorkshops we should consider feasible strategies for how the two practices can be usefully combined.

o       Start with pairing, then formally inspect key artifacts [Krebs 100]

o       Use PP in general and inspection in critical cases (complex modules, high quality necessary etc.) [Rumpe 102]

o       PP is better for "tactical" quality improvement (how do I most efficiently use Eclipse to debug this particular NPE), while inspection is better for "strategic" quality improvement (look for concurrency/synchronization errors in this package). [Johnson 169]

§         Hypothesis Rumpe (Pre-Meeting Feedback): With PP you can reach a certain level of quality more efficiently, but you cannot go beyond that level. With inspections its more tedious to reach that level, but possible to go beyond. (You cannot add a third person to PP, but you can have more inspections with regards to more viewpoints.) This might mean, it’s interesting to combine both techniques: PP for quality level one and beyond that inspections (when necessary).

·        Have an experiment on the effect of swapping pairs [Williams 189, Ambler 174]  I think the concrete research question needs to be defined here. Based on the eWorkshop discussion I also perceive the question whether swapping pairs can replace a third persons perspective as interesting.

·        Future eWorkshops or studies should try to refine the defect types that can be more easily addressed with PP versus those that are better suited to Inspections.[Boehm 242, Denger 247, Wiegers 249, Williams 262]

o       Hypothesis Arisholm (Pre-Meeting Feedback): I suspect that the two alternative techniques may be useful for detecting different kinds of defects. For example, formal inspections might detect defects caused by integration issues better than pair programming. Clearly, such claims need to be investigated empirically

·        We need more information on the effects of pairs on other development activities besides programming. [Ambler 285] Do we have evidence that Pairing is useful for requirements and design as well? (vote results [499] show: there is no consensus between the participants, even most of them are not sure)

o       Hypothesis by Wiegers (Pre-Meeting Feedback): PP works well for developing code but not for developing other types of software work products, which also need to be reviewed.

 



[1] Kent Beck in Extreme Programming Explained – Embrace Change: “If code reviews are good, we’ll review code all the time (pair programming).”

[2] “Proving with code” is one of the core practices of Agile Modeling. It says “Prove your models with code to see if they work”, for example, if you produced a UML-sequence diagram you should write the corresponding code, test it, and show the results to the customer to get feedback before you continue modeling. Ambler has said: “Proving it with code is a critical practice that supports evolutionary development because it provides the link from modeling to implementation, once again helping you to break out of a BDUF mindset.”