Waits & measures

XIII.6 November + December 2006
Page: 29
Digital Citation

Sample sizes for usability tests


Authors:
James Lewis

Why do we keep talking about appropriate sample sizes for usability tests?

Perhaps the most important factor is the economics of usability testing. For many practitioners, usability tests are fairly expensive events, with much of the expense in the variable cost of the number of participants observed (which includes cost of participants, cost of observers, cost of lab, and limited time to obtain data to provide to developers in a timely fashion). Excessive sampling is always wasteful of resources [9], but when the cost of an additional sample (in usability testing, an additional participant) is high, it is very important that the benefit of additional sampling outweighs the cost.

Another factor is the wide range of test and evaluation situations that fall under the umbrella of usability testing. Usability testing includes three key components: representative participants, representative tasks, and representative environments, with participants’ activities monitored by one or more observers [2]. Within this framework, however, usability tests have wide variation in method and motivation. They can be formal or informal, think-aloud or not, use low-fidelity prototypes or working systems. They can have a primary focus on task-level measurements (summative testing) or problem discovery (formative testing). This latter distinction is very important, as it determines the appropriate general approach to sample-size estimation for usability tests.

When the focus is on task-level measurements, sample-size estimation is relatively straightforward, using mainstream statistical techniques that have been available since the early 20th century (in some cases, even earlier). Basically, you need an estimate of the variance of the dependent measure(s) of interest (typically obtained from previous, similar studies or pilot data) and an idea of how precise the measurement must be (which is a function of the magnitude of the desired minimum critical difference and statistical confidence level); once you have that, the rest is arithmetic. There are numerous sources for information on standard sample-size estimation [6, 23]. For this reason, I’m not going to describe them in any additional detail here (but for a detailed discussion of this type of sample size estimation in the context of usability testing, see Lewis [14]). The less-well-understood problem is sample-size estimation for problem-discovery (formative) testing.

A Little History. I first encountered this problem when I starting working for IBM in 1981, fresh from graduate school. The IBM practice at that time, based on papers published by Alphonse Chapanis and colleagues [1, 5], was to observe about five to six participants per iteration for problem discovery. Chapanis had asserted that after you’d observed six participants, you would have seen about all of the problems you were going to see. Based on graduate statistics classes I’d had with James Bradley [3, 4], I thought that there must be a way to more precisely estimate sample sizes for these types of tests. Specifically, it seemed like you should be able to use the binomial probability formula for this purpose, and I mentioned this briefly in my first publication [10]:

The binomial probability theorem can be used to determine the probability that a problem of probability p will occur r times during a study with n subjects. For example, if an instruction will be confusing to 50 percent of the user population, the probability that one subject will be confused is 0.5. If two subjects are observed, then the probability that either one or both subjects will be confused is 0.75; and if three subjects are observed, the probability that at least one of them will be confused is 0.875.

 

I didn’t mention the now-famous formula 1-(1-p)n in that paper, but that’s the formula I used for the computations. Bradley taught his students that this was a very useful formula for many situations, derived from the binomial probability formula as P(At least once) = 1 - P(0) (in other words, the probability of something happening at least once is 1 minus the probability of its not happening at all). When r = 0 in the binomial probability formula, P(0) is (1-p)n, so P(At least once) is 1-(1-p)n.

The years 1990 through 1994 saw a series of publications investigating the use of the formula to model usability problem discovery, including empirical verification of its accuracy for problem discovery studies, in which sample size refers to the number of participants, and heuristic evaluations, in which sample size refers to the number of independent observers [21, 22, 25, 15, 12]. These studies provided quite a bit of evidence that 1-(1-p)n is a good model of problem discovery. For problem-discovery tests, this literature contains several large-sample examples that showed p ranging from 0.16 to 0.42 [12]. For several large-sample heuristic evaluations, the reported value of p ranged from 0.22 to 0.60 [16].

So, what does 1-(1-p)n suggest about usability-problem discovery? Note that there are only two variables—p and n. The most direct interpretation of this is that many other variables that we might assume would affect problem discovery—such as the cost of fixing a problem or the severity of the problem from the user’s perspective—don’t. For example, Virzi [22] reported earlier discovery of more-serious problems, but I failed to replicate that finding [12]. Also, a return-on-investment (ROI) model in the same paper showed that as the magnitude of the savings associated with early discovery versus late discovery increased, the ROI of a usability study also increased, but this factor had no appreciable effect on the sample size at maximum ROI [12].

An additional outcome of the ROI study was that the appropriate problem discovery goal depended on the value of p. The model indicated that if the expected value of p was small (say, around 0.10), practitioners should plan to discover about 86 percent of the problems. If the expected value of p was larger (say, around 0.25 or 0.50), practitioners should plan to discover about 98 percent of the problems. For expected values of p between 0.10 and 0.25, practitioners should interpolate between 87 and 97 percent to determine an appropriate goal for the percentage of problems to discover. The analysis did not address values of p smaller than 0.10, but, presumably, the appropriate goal would be something less than 86 percent.

If you know or can estimate the expected value of p for a study and know the desired problem discovery goal, you can compute n with the following formula (derived algebraically from Goal = 1-(1-p)n, solving for n):

eq01.gif

But getting an estimate of p can be tricky if you’re working with small samples. For many years, I’d assumed that small-sample estimates of p would behave like small-sample estimates of the arithmetic mean—that they would have more variability than large-sample estimates, but would be unbiased (tending to have the same value as large-sample estimates in the long run). In 2001 I found out that this assumption was completely wrong. I was editing a special issue of the International Journal of Human-Computer Interaction on Usability Evaluation (Vol. 13, No. 4), and received a manuscript from Morten Hertzum and Niels Jacobsen in which they proved that small-sample estimates of p were necessarily biased to be higher than the actual population problem discovery rate [7]!

In response to this, I investigated a number of methods for adjusting problem-discovery rates estimated from small samples [13]. The best method for compensating for the bias was to average two methods—one method based on Good-Turing discounting and a normalization method based on the work of Hertzum and Jacobsen. The resulting adjustment looks complicated, but it won’t seem quite so bad after going through a worked-out example (in the next section):

eq02.gif

GTadj is the Good-Turing adjustment to probability space to account for unseen events (which is the proportion of the number of problems that occurred once divided by the total number of different problems). The pest/(1+ GTadj) component in the equation produces the Good-Turing-adjusted estimate of p by dividing the observed, unadjusted estimate of p (pest) by the Good-Turing adjustment to probability space. The (pest - 1/n)(1 - 1/n) component in the equation produces the normalized estimate of p from the observed, unadjusted estimate of p and n (the sample size used to estimate p). The adjustment uses the average of these two estimates, because the Good-Turing estimator tends to overestimate the true value of p, but normalization tends to underestimate it [13]. Note that the Good-Turing adjustment is a function of the number of infrequently occurring problems, whereas normalization is a function of the estimate’s sample size. The Monte Carlo experiments of Lewis [13] demonstrated that this adjustment works very well, even with initial sample sizes as small as two to four participants.

A Hypothetical Example. The best way to work with these formulas is to create a participant-by-problem matrix, as shown in Table 1.

One of several ways to compute p is to divide the number of problem occurrences by the number of participants times the number of problems. After running eight participants, the estimate of p is 0.375 (12/(8*4)). But what did things look like after having run the first four? At that time there was no evidence that Problem 2 existed, so the estimate of p was 6/(3*4), or 0.500 (an example of the bias described by Hertzum and Jacobsen, [7]). Furthermore, suppose you had established a goal of 90 percent problem discovery.

If you were to estimate the sample-size requirement using the unadjusted value of p, you’d get n = log(1-.90)/log(1-0.5) = log(0.1)/log(.5) = (-1)/(-0.3) = 3.3, which rounds up to 4.

How much would this change using the adjusted value of p? First, let’s do the Good-Turing adjustment. We need to know the total number of discovered problems (three after having observed four participants), and how many of those had occurred just once (one). For this example, the adjustment is 0.5/(1 + 1/3), which equals 0.375. Next is the normalization procedure, which is (0.5 - 1/4)(1 - 1/4) = 0.188. The average of these two adjustments is 0.28—almost half the unadjusted value. The correspondingly adjusted estimate of n is log(1-0.90)/log(1-0.28) = log(0.1)/log(0.72) = (-1)/(-0.143) = 7—almost double the original estimate (but still not terribly large).

As an exercise to the reader, what are the adjusted values for p and n if you use the data from all eight participants in Table 1? If you don’t want to drag out the calculator with the log functions, try the sample-size calculator at the Measuring Usability Web site (http://www.measuringusability.com/samplesize/problem_discovery.php-[18]).

The "Eight Is Not Enough" Example. In 2001, Spool and Schroeder published the results of a large-scale usability evaluation in which they concluded that five users were "nowhere near enough" to find all (or even 85 percent) of the usability problems in the Web sites they were studying. Perfetti and Landesman [17], discussing related research, stated:

When we tested the site with 18 users, we identified 247 total obstacles-to-purchase. Contrary to our expectations, we saw new usability problems throughout the testing sessions. In fact, we saw more than five new obstacles for each user we tested. Equally important, we found many serious problems for the first time with some of our later users. What was even more surprising to us was that repeat usability problems did not increase as testing progressed. These findings clearly undermine the belief that five users will be enough to catch nearly 85 percent of the usability problems on a Web site. In our tests, we found only 35 percent of all usability problems after the first five users. We estimated over 600 total problems on this particular online music site. Based on this estimate, it would have taken us 90 tests to discover them all!

 

The information provided in this paragraph shows that the value of p in this study was very small. If there were 600 usability problems available for discovery given the study’s method, then 247 problems are 41 percent of the total available for discovery. Taking 1-(1-p)18 = 0.41 and solving for p gives p = 0.029.

Given p = 0.029, the percentage of discovery expected when n = 5 is 13.7 percent. In accordance with the data reported by Perfetti and Landesman, 13.7 percent of 600 is 82 problems, which is about 35 percent of the total number of problems they discovered with 18 participants (35 percent of 247 is 86).

For the conditions present in their study, it is not surprising that they continued to see more than five new problems with each participant. In fact, you wouldn’t expect the number of new problems per participant to fall below five until around the 45th participant. This is what you’d generally expect with a low problem discovery rate and a large number of problems available for discovery.

Their discovery of serious problems with later users is consistent with Lewis [12], which failed to replicate the early discovery of serious problems reported by Virzi [22].

The low incidence of repeat usability problems is also consistent with low values of p. A high incidence of repeat usability problems is more likely with evaluations of early designs than evaluations of more mature designs. Usability testing of designs that have already had common usability problems removed is likely to uncover problems that are relatively idiosyncratic, which seems to have been the case with this study. Also, as the authors report, the tasks given to participants were somewhat unstructured, which could have expanded the space of problems available for discovery.

Their primary conclusion—that five or eight users aren’t enough to discover 85 percent of the problems available for discovery when p = 0.029—is well founded. On the other hand, even with this extremely low value of p, the expected percentage discovered with eight participants is about 21 percent, which is certainly better than not running any participants at all. When p is this small, if the goal is to discover 85 percent of the problems available for discovery, then the required sample size is 62. If the goal is to discover 99 percent ("all") of the 600 problems, then the required sample size is 140.

What we don’t know from this study is how likely it is to have such a low value of p. The authors surmised that this might be a characteristic of usability studies of Web sites, but it could also be a function of the testing method or the level of description of usability problems. Regardless, this example illustrates the importance of computing an early estimate of p and making an explicit decision about the desired percentage of problem discovery as integral steps for rationally determining the required sample size.

Discussion. We know a lot more about how to estimate required sample sizes for usability problem-discovery tests than we did 25 years ago, but I don’t believe that this knowledge is very prevalent throughout the usability testing community, nor is it widely taught to graduate students. I hope that recent publications [14, 20] will change the current situation.

There will, of course, continue to be discussions about sample sizes for problem-discovery usability tests, but I hope they will be informed discussions. If a practitioner says that five participants are all you need to discover most of the problems that will occur in a usability test, it’s likely that this practitioner is typically working in contexts that have a fairly high value of p and fairly low problem discovery goals. If another practitioner says that he’s been running a study for three months, has observed 50 participants, and is continuing to discover new problems every few participants, then it’s likely that he has a somewhat lower value of p, a higher problem discovery goal, and lots of cash (or a low-cost audience of participants). Neither practitioner is necessarily wrong—they’re just working in different usability testing spaces. The formulas developed over the past 25 years provide a principled way to understand the relationship between those spaces, and a better way for practitioners to routinely estimate sample-size requirements for these types of tests.

References

1. Al-Awar, J., Chapanis, A., & Ford, R. (1981). Tutorials for the first-time computer user. IEEE Transactions on Professional Communication, 24, 30-37.

2. ANSI. (2001). Common industry format for usability test reports (ANSI-NCITS 354-2001). Washington, DC: American National Standards Institute.

3. Bradley, J. V. (1968). Distribution-free statistical tests. Englewood Cliffs, NJ: Prentice-Hall.

4. Bradley, J. V. (1976). Probability; decision; statistics. Englewood Cliffs, NJ: Prentice-Hall.

5. Chapanis, A. (1981). Evaluating ease of use. Unpublished manuscript prepared for IBM, available on request from J. R. Lewis.

6. Diamond, W. J. (1981). Practical experiment designs for engineers and scientists. Belmont, CA: Lifetime Learning Publications.

7. Hertzum, M., & Jacobsen, N. J. (2003). The evaluator effect: A chilling fact about usability evaluation methods. International Journal of Human-Computer Interaction, 15, 183-204.

8. ISO. (1998). Ergonomic requirements for office work with visual display terminals (VDTs) - Part 11: Guidance on usability (ISO 9241-11:1998(E)). Geneva, Switzerland: Author.

9. Kraemer, H. C., & Thiemann, S. (1987). How many subjects? Statistical power analysis in research. Newbury Park, CA: Sage.

10. Lewis, J. R. (1982). Testing small system customer set-up. In Proceedings of the Human Factors Society 26th Annual Meeting (pp. 718-720). Santa Monica, CA: Human Factors Society.

11. Lewis, J. R. (1993). Problem discovery in usability studies: A model based on the binomial probability formula. In Proceedings of the Fifth International Conference on Human-Computer Interaction (pp. 666-671). Orlando, FL: Elsevier.

12. Lewis, J. R. (1994). Sample sizes for usability studies: Additional considerations. Human Factors, 36, 368-378.

13. Lewis, J. R. (2001). Evaluation of procedures for adjusting problem-discovery rates estimated from small samples. International Journal of Human-Computer Interaction, 13, 445-479

14. Lewis, J. R. (2006). Usability testing. In G. Salvendy (ed.), Handbook of Human Factors and Ergonomics (pp. 1275-1316). New York, NY: John Wiley.

15. Nielsen, J., & Landauer, T.K. (1993). A mathematical model of the finding of usability problems. In Proceedings of ACM INTERCHI’93 Conference (pp. 206-213). Amsterdam, Netherlands: ACM Press.

16. Nielsen, J., & Molich, R. (1990). Heuristic evaluation of user interfaces. In Conference Proceedings on Human Factors in Computing Systems - CHI90 (pp. 249-256). New York, NY: ACM.

17. Perfetti, C., & Landesman, L. (2001). Eight is not enough. Retrieved July 4, 2006 from http://www.uie.com/articles/eight_is_not_enough/

18. Sauro, J. (2006). UI problem discovery sample size. Downloaded from Measuring Usability website, July 20, 2006-http://www.measuringusability.com/samplesize/problem_discovery.php.

19. Spool, J., & Schroeder, W. (2001). Testing web sites: Five users is nowhere near enough. In CHI 2001 Extended Abstracts (pp. 285- 286). New York: ACM Press.

20. Turner, C. W., Lewis, J. R., & Nielsen, J. (2006). Determining usability test sample size. In W. Karwowski (ed.), International Encyclopedia of Ergonomics and Human Factors (pp. 3084-3088). Boca Raton, FL: CRC Press.

21. Virzi, R. A. (1990). Streamlining the design process: Running fewer subjects. In Proceedings of the Human Factors Society 34th Annual Meeting (pp. 291-294). Santa Monica, CA: Human Factors Society.

22. Virzi, R.A. (1992). Refining the test phase of usability evaluation: How many subjects is enough? Human Factors, 34, 457-468.

23. Walpole, R. E. (1976). Elementary statistical concepts. New York, NY: Macmillan.

24. Wixon, D. (2003). Evaluating usability methods: Why the current literature fails the practitioner. interactions, 10(4), 28-34.

25. Wright, P. C., & Monk, A. F. (1991). A cost-effective evaluation method for use by designers. International Journal of Man-Machine Studies, 35, 891-912.

Author

James R. Lewis
IBM Corp.
jimlewis@us.ibm.com

About the Author:

Jim Lewis has been a usability practitioner at IBM since 1981, working primarily on input methods (especially speech input) and usability evaluation. He studied engineering psychology and applied statistics at New Mexico State University (MA, 1982) and psycholinguistics at Florida Atlantic University (PhD, 1996). He has written several papers on standardized usability questionnaires and sample-size determination and recently wrote the usability testing chapter for the third edition of the Handbook of Human Factors and Ergonomics.

Tables

T1Table 1. Data from a Hypothetical Usability Test with Eight Subjects, pest = 0.375

Sidebar: The Goal: Problem Discovery

You can’t really talk about discovering 90 percent of all possible usability problems across all possible users, tasks, and environments. You can establish a problem discovery goal given a sampled population of users, a defined set of tasks, and a defined set of environments. Change the population of users, tasks, or environments, and all bets are off. But this is better than nothing. If your problem discovery rate is starting to go down, then change one or all of these elements of usability. Test from a different population of users, using different tasks, in different environments. You’ll discover different problems.

Sidebar: Solution to the Exercise

The Good-Turning adjustment is 0.375/(1 + 2/4) = 0.25. The normalization adjustment is (0.375-1/8)(1-1/8) = 0.22. Their average, the adjusted estimate of p, is 0.235, a little smaller than the adjusted value at n = 4. The corresponding adjusted estimate for n is log(1-0.90)/log(1-0.235) = log(0.1)/log(0.765) = (-1)/(-0.116) = 8.6, which rounds up to 9. The hypothetical practitioner might consider running one more participant, given the resources to do so. If not, the practitioner can assess the adequacy of the sample size by using the basic formula 1-(1-p)n. The estimated proportion of problems discovered is 1-(1-0.235)8, which is 0.88 (88 percent)—only a little short of the goal of 90 percent.

©2006 ACM  1072-5220/06/1100  $5.00

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2006 ACM, Inc.

 

Post Comment


No Comments Found