In any field, certain controversies have a tendency to re-emerge periodically. Sometimes they cause us to revisit issues in ways that stimulate interesting and challenging discussions. Sometimes they direct our attention to simple but overlooked choices that make a critical difference. Other times these controversies are more like zombies; they come back from the dead to devour the brains of the living by misdirecting the otherwise intelligent to irrelevant and trivial questions. A recent controversy to re-emerge in the field of user-centered design is the question of how many users we need to study in order to detect an appropriate number of actual usability problems. This has been a classic controversy, with many valuable contributions to our understanding of how to generalize from a small lab study to a larger population of users [5,10]. However, sometimes it has generated more heat than light.
The re-emergence of the "how many users" debate offers us an opportunity to look at our field more deeply than we have before. When we examine the premise of this debate, it becomes clear that this controversy raises questions much broader than simply how many participants we need to test. It is really an opportunity for us to consider what the goals of our methods are and how we should determine the quality of our methods.
The question of effectiveness of method has been with us for a while. The seminal paper (Jefferies, Miller, Wharton, and Udea 1991) compared four methods: heuristic evaluation, cognitive walkthroughs, use of guidelines, and user testing. The criteria used to evaluate the methods were (1) their ability to detect interface problems, (2) who could use them, and (3) how much effort was involved to use them.
In parallel, a literature on discount methods was emerging, which suggested that practitioners run a small number of users in quickly executed studies. In this literature, the primary criterion for evaluating methods and choosing a sample size was efficiency. Efficiency in turn was defined as number of problems detected as a function of effort (number of participants used). A series of statistical papers and practical studies showed that relatively few participants could detect most problems that would be found when the product was deployed, if the problems were relatively common.
The discussion of evaluation of methods took a dramatic turn with a provocative paper by Gray and Saltzman . The paper used a classical experimental design perspective  to generate a trenchant critique of the work that had been done to date. The article was accompanied by a number of scathing rejoinders from prominent researchers and practitioners.
This literature has come to dominate almost all discussion of the evaluation of methods in our field. Indeed, it often seems that we are preoccupied with the question of how many users it takes to test a design. Although certainly many controversies exist, all of this literature embodies a shared set of premises:
- Number of problems detected is the most appropriate criterion for evaluating a method.
- Methods can be evaluated in relative isolation from the practical goals of the method and the context in which the method is used.
- A quasi-scientific framework is the most effective approach to resolve disputes about the best method.
All three of these premises render most of this literature irrelevant to applied usability work, by which I mean the application of usability work to the development of products in real commercial enterprises.
Applied usability work is just like any other part of product developmentits practices derive not from the scientific method, but from engineering. In essence, engineering is simply a disciplined effort to build a system to meet a set of defined goals or criteria, while optimizing trade-offs in the most efficient way possible, within a given context  and given constraints. In practice, the goal is to produce, in the quickest time, a successful product that meets specifications with the fewest resources, while minimizing risk. In the real world of product development, schedule and resource issues often understandably dominate all discussions of design and trade-offs, both because of how crucial these issues are, and because other product goals are less clear or perceived to be more speculative.
From this applied perspective, the three recurring themes identified previously severely limit the usefulness of the entire body of literature about methods and sample size. First, problem detection is simply the first step to improving the product. It may be necessary but it is not sufficient for product improvement or for method evaluation.
Second, isolating a method from the broader context in which it is used renders any purported evaluation to be of little practical significance because it eliminates important considerations such as team buy-in, resources available, relative ease of making a change, and numerous other practical considerations. A method that detected 100 percent of problems with one user would still fail if it did not work within the development process of an organization.
Finally, a scientific approach to validating methods is inconsistent with the underlying philosophy and context of product development, which takes an engineering approach. It is not surprising that the challenge laid down by Gray and Saltzman has not been followed up in any applied setting. It is not simply that corporations cannot evaluate their designs by following rigorous experimental procedure. In vitro experimental comparisons of methods do not capture crucial aspects of the in vivo context of product development, such as credibility and timeliness of usability data and degree of integration of usability with the overall development process and dynamics, to mention just a few factors. Therefore, the methodological prescriptions of corporations are likely to be inapplicable to the applied context.
Consider the debate about how many participants are needed. It is focused exclusively on problem detectionthat is, how likely that we will find a given subset of the problems. In a practical world, applying such a criteria to evaluating methods is at best short sighted and at worst destructive.
It is short sighted in that it ignores that problems should be fixed and not just found. If we considered instead a more relevant criterionnamely, how much can we improve the product in the shortest time with the least effort?we probably would not have asked this question, or we would have asked it in a different form, such as, "What is the best way of deploying the usability resources we have available for this development cycle in order to maximize our beneficial impact on the product?"
In addition, it is destructive because it implies an approach to applied usability that requires the repeated testing of an identical system (so we can continue to uncover problems) even though easily fixable problems may have been uncovered in the first few sessions. On its face such an approach and the thinking it reflects are wasteful and damaging to the practice of usability work in corporations.
In the world of usability work on real products embedded in a corporate and business framework, we must focus on factors of success, such as how effectively the method introduces usability improvements into the product. This depends on things like whether the method, in its very practice, encourages participation, buy-in, and collaboration by the development team. In practice, most methods that have real impact do this. It's no accident that most usability testing involves encouraging entire design teams to watch the test, and it is well known that much of the effectiveness of the test comes from this active participation. Yet, this important element in making methods effective is by and large ignored in the formal literature, which treats usability studies as if they were experiments, when in reality they are more like organizational interventions.
In my opinion, a case study approach is both the only practical way to produce a body of knowledge for applied usability and the most effective. First, products get made all the time, and much usability work is applied to them. It makes sense to learn all we can from our own practice. Second, the development of real products is the only context sufficiently rich to produce the kind of nuanced examples that are needed to develop a differentiated and contextualized understanding of methods and techniques needed by practitioners. An accumulation of case studies might even permit meta-analyses to be conducted that would help in suggesting patterns that can be generalized across cases. For example, we might be able to extract principles that determine the relative impacts of various design elements and evaluation techniques on a product's success.
To summarize, first, developing a sophisticated understanding of methodology for applied usability will require us to adopt business- and engineering-relevant criteria, such as how much we were able to improve the product and how the product fared in the marketplace. Second, we need to evaluate methods in vivo, that is, by applying them to real products embedded in real engineering, corporate, and political environments and not on simulated systems or hypothetical models. Third, we need to adopt a case study rather than an experimental approach.
I would like to provide two case examples, not to demonstrate the full-fledged "case study" approach I advocate, but simply to show the inapplicability of the research literature on evaluating methodology to the real context of product development. The first is the application of the rapid iterative testing and evaluation (RITE) method to Microsoft's highly successful game Age of Empires II. The second has to do with discovering usability problems in the field while working on a new version of MSN.
Age of Empires II
The business goal of increasing the market segment for Age of Empires II required us to increase its appeal to users who had never played such a game before. This goal dictated that an engaging and effective tutorial be created so that new users could both learn the basic tactical elements of the game and experience the pleasure of building a simulated civilization and winning battles. The design team believed that an integral part of achieving this goal was to create a tutorial that could be quickly and easily modified. This combination of an effective architecture and limited time gave birth to the process of rapid iterative testing and evaluation, or RITE. Overall the process we used was straightforward and effective. The behavioral and cognitive objectives for the tutorial were defined, an early prototype was developed, and testing began. In each iteration, we tested a few users (one to three) with the entire team watching. As a team, everyone assessed the test in terms of the following questions :
- What problems did we see?
- Can we explain them?
- Can we fix them before the next test?
If the team could not come to consensus on what the problem was, how to explain it, or what the fix should be, then the problem stayed in the system for the next test. Those that we felt we could understand and that could be rapidly repaired were acted on immediately.
As the team continues using RITE, fixes are produced quickly, because the entire team is involved in analysis and interpretation. Using the method also gives the team a chance to verify whether a fix worked and to iterate on it again if it did not. It improves user performance quicklyand ironically that allows for the discovery of more problems since users can go further into the system. Practically, you can stop at any time with a high level of certainty that the system has improved. Or you can continue the process until a defined goal is met. In this case, our approach produced a highly effective tutorial that was praised by reviewers and users, and the resulting game won several awards and achieved significant commercial success.
A second example comes from our work on MSN 8.5. In this case, we carried out opportunistic usability evaluations during a program of ethnographic field visits. "Opportunistic" means that we observed and analyzed difficulties that users encountered spontaneously as they demonstrated their usage of the system to us. Task scenarios were not standardized, because users were doing whatever tasks mattered to them, in the context of their own systems, with their own file structures (a virtue of this approach compared with laboratory testing). In this context, it becomes almost irrelevant to ask "how many users" are involved in each test. Nevertheless, these observations had a powerful impact on team observers and as a result problems that had not emerged in laboratory testing were detected.
The key point I want to make here about both of these cases is that the approach was necessitated not by purely methodological considerations, but by organizational considerations and the need to integrate with other development activities. Also, the factors that determined the success of both cases in positively influencing the development of their respective productsfactors that must be considered in order to evaluate methods for use in the real worldare completely ignored by the formal literature on comparison of methods.
As I stated, these case examples are not representative of the types of case studies I am advocating, which would have to provide a much deeper analysis of the case. Something which comes closer is the more extensive documentation of the Age of Empires II work that is available at www.microsoft.com/playtest/publications.htm. It includes discussion of things that did not work well, and what we would do differently next time. A deeper discussion such as that found on the Web site is a better example of the fullness of a report that would be required for the case study literature I am advocating. Other effective case examples are available in the literature. (See, for example, [3, 6, 7, 8, 9].)
The literature evaluating usability methods is fundamentally flawed by its lack of relevance to applied usability work. These failures are particularly ironic because much of this literature takes a critical perspective on applied usability methods. If our discipline is serious about public discussion of usability methods as they are applied in industry, we will move beyond these lines of inquiry and take a broad-based case study approach, examining outcomes that are relevant to both practice and business. Our relevance as a discipline and our career success as practitioners depend on such a change.
4. Jefferies, R. Miller, J. Wharton, C. and Udea, K.M. User Interface Analysis in the Real World; A comparison of four techniques. Proceedings of the ACM CHI'91. Conference in Human Factors in Computing Systems (New York, 1991), pp.119-124.
9. Spencer, R. (2000) The Streamlined Cognitive Walkthrough Method, Working Around Social Constraints Encountered in a Software Development Company. Proceedings of ACM CHI'2000. Conference in Human Factors in Computing Systems (New York, 2000), pp. 119-124.
11. Whiteside, J. Bennett, J. and Holtzblatt, K. (1988). Usability engineering: Our experience and evolution. In M. Helander (ed,) Handbook of Human Computer Interaction, (1988), North Holland, New York, pp. 791-817.
Susan Dray & David A. Siegel
Dray & Associates, Inc.
2007 Kenwood Parkway
Minneapolis, MN 55405, USA
612-377-1980 fax: 617-377-0363
©2003 ACM 1072-5220/03/0700 $5.00
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2003 ACM, Inc.