**Authors:**

James Lewis

Why do we keep talking about appropriate sample sizes for usability tests?

Perhaps the most important factor is the economics of usability testing. For many practitioners, usability tests are fairly expensive events, with much of the expense in the variable cost of the number of participants observed (which includes cost of participants, cost of observers, cost of lab, and limited time to obtain data to provide to developers in a timely fashion). Excessive sampling is always wasteful of resources [9], but when the cost of an additional sample (in usability testing, an additional participant) is high, it is very important that the benefit of additional sampling outweighs the cost.

Another factor is the wide range of test and evaluation situations that fall under the umbrella of usability testing. Usability testing includes three key components: representative participants, representative tasks, and representative environments, with participants’ activities monitored by one or more observers [2]. Within this framework, however, usability tests have wide variation in method and motivation. They can be formal or informal, think-aloud or not, use low-fidelity prototypes or working systems. They can have a primary focus on task-level measurements (summative testing) or problem discovery (formative testing). This latter distinction is very important, as it determines the appropriate general approach to sample-size estimation for usability tests.

When the focus is on task-level measurements, sample-size estimation is relatively straightforward, using mainstream statistical techniques that have been available since the early 20th century (in some cases, even earlier). Basically, you need an estimate of the variance of the dependent measure(s) of interest (typically obtained from previous, similar studies or pilot data) and an idea of how precise the measurement must be (which is a function of the magnitude of the desired minimum critical difference and statistical confidence level); once you have that, the rest is arithmetic. There are numerous sources for information on standard sample-size estimation [6, 23]. For this reason, I’m not going to describe them in any additional detail here (but for a detailed discussion of this type of sample size estimation in the context of usability testing, see Lewis [14]). The less-well-understood problem is sample-size estimation for problem-discovery (formative) testing.

**A Little History.** I first encountered this problem when
I starting working for IBM in 1981, fresh from graduate school.
The IBM practice at that time, based on papers published by
Alphonse Chapanis and colleagues [1,
5], was to observe about five to six participants
per iteration for problem discovery. Chapanis had asserted that
after you’d observed six participants, you would have seen about
all of the problems you were going to see. Based on graduate
statistics classes I’d had with James Bradley [3,
4], I thought that there must be a way to more
precisely estimate sample sizes for these types of tests.
Specifically, it seemed like you should be able to use the
binomial probability formula for this purpose, and I mentioned
this briefly in my first publication [10]:

The binomial probability theorem can be used to determine the probability that a problem of probability

pwill occurrtimes during a study withnsubjects. For example, if an instruction will be confusing to 50 percent of the user population, the probability that one subject will be confused is 0.5. If two subjects are observed, then the probability that either one or both subjects will be confused is 0.75; and if three subjects are observed, the probability that at least one of them will be confused is 0.875.

I didn’t mention the now-famous formula 1-(1-*p*)*n*
in that paper, but that’s the formula I used for the
computations. Bradley taught his students that this was a very
useful formula for many situations, derived from the binomial
probability formula as P(At least once) = 1 - P(0) (in other
words, the probability of something happening at least once is 1
minus the probability of its not happening at all). When *r*
= 0 in the binomial probability formula, P(0) is
(1-*p*)^{n}, so P(At least once) is
1-(1-*p*)^{n}.

The years 1990 through 1994 saw a series of publications
investigating the use of the formula to model usability problem
discovery, including empirical verification of its accuracy for
problem discovery studies, in which sample size refers to the
number of participants, and heuristic evaluations, in which
sample size refers to the number of independent observers
[21, 22, 25,
15, 12]. These studies provided
quite a bit of evidence that 1-(1-*p*)^{n} is
a good model of problem discovery. For problem-discovery tests,
this literature contains several large-sample examples that
showed p ranging from 0.16 to 0.42 [12]. For
several large-sample heuristic evaluations, the reported value of
*p* ranged from 0.22 to 0.60 [16].

So, what does 1-(1-*p*)^{n} suggest about
usability-problem discovery? Note that there are only two
variables—*p* and *n*. The most direct
interpretation of this is that many other variables that we might
assume would affect problem discovery—such as the cost of
fixing a problem or the severity of the problem from the user’s
perspective—don’t. For example, Virzi [22]
reported earlier discovery of more-serious problems, but I failed
to replicate that finding [12]. Also, a
return-on-investment (ROI) model in the same paper showed that as
the magnitude of the savings associated with early discovery
versus late discovery increased, the ROI of a usability study
also increased, but this factor had no appreciable effect on the
sample size at maximum ROI [12].

An additional outcome of the ROI study was that the
appropriate problem discovery goal depended on the value of
*p*. The model indicated that if the expected value of
*p* was small (say, around 0.10), practitioners should plan
to discover about 86 percent of the problems. If the expected
value of *p* was larger (say, around 0.25 or 0.50),
practitioners should plan to discover about 98 percent of the
problems. For expected values of *p* between 0.10 and 0.25,
practitioners should interpolate between 87 and 97 percent to
determine an appropriate goal for the percentage of problems to
discover. The analysis did not address values of *p* smaller
than 0.10, but, presumably, the appropriate goal would be
something less than 86 percent.

If you know or can estimate the expected value of *p* for
a study and know the desired problem discovery goal, you can
compute *n* with the following formula (derived
algebraically from *Goal* =
1-(1-*p*)^{n}, solving for *n*):

But getting an estimate of *p* can be tricky if you’re
working with small samples. For many years, I’d assumed that
small-sample estimates of *p* would behave like small-sample
estimates of the arithmetic mean—that they would have more
variability than large-sample estimates, but would be unbiased
(tending to have the same value as large-sample estimates in the
long run). In 2001 I found out that this assumption was
completely wrong. I was editing a special issue of the
*International Journal of Human-Computer Interaction* on
Usability Evaluation (Vol. 13, No. 4), and received a manuscript
from Morten Hertzum and Niels Jacobsen in which they proved that
small-sample estimates of *p* were necessarily biased to be
higher than the actual population problem discovery rate
[7]!

In response to this, I investigated a number of methods for adjusting problem-discovery rates estimated from small samples [13]. The best method for compensating for the bias was to average two methods—one method based on Good-Turing discounting and a normalization method based on the work of Hertzum and Jacobsen. The resulting adjustment looks complicated, but it won’t seem quite so bad after going through a worked-out example (in the next section):

*GT _{adj}* is the Good-Turing adjustment to
probability space to account for unseen events (which is the
proportion of the number of problems that occurred once divided
by the total number of different problems). The

*p*/(1+

_{est}*GT*) component in the equation produces the Good-Turing-adjusted estimate of

_{adj}*p*by dividing the observed, unadjusted estimate of

*p*(

*p*) by the Good-Turing adjustment to probability space. The (

_{est}*p*- 1/

_{est}*n)*(1 - 1/

*n*) component in the equation produces the normalized estimate of

*p*from the observed, unadjusted estimate of

*p*and

*n*(the sample size used to estimate

*p*). The adjustment uses the average of these two estimates, because the Good-Turing estimator tends to overestimate the true value of

*p*, but normalization tends to underestimate it [13]. Note that the Good-Turing adjustment is a function of the number of infrequently occurring problems, whereas normalization is a function of the estimate’s sample size. The Monte Carlo experiments of Lewis [13] demonstrated that this adjustment works very well, even with initial sample sizes as small as two to four participants.

**A Hypothetical Example.** The best way to work with these
formulas is to create a participant-by-problem matrix, as shown
in Table 1.

One of several ways to compute *p* is to divide the
number of problem occurrences by the number of participants times
the number of problems. After running eight participants, the
estimate of *p* is 0.375 (12/(8*4)). But what did things
look like after having run the first four? At that time there was
no evidence that Problem 2 existed, so the estimate of *p*
was 6/(3*4), or 0.500 (an example of the bias described by
Hertzum and Jacobsen, [7]). Furthermore, suppose
you had established a goal of 90 percent problem discovery.

If you were to estimate the sample-size requirement using the
unadjusted value of *p*, you’d get *n* =
log(1-.90)/log(1-0.5) = log(0.1)/log(.5) = (-1)/(-0.3) = 3.3,
which rounds up to 4.

How much would this change using the adjusted value of
*p*? First, let’s do the Good-Turing adjustment. We need to
know the total number of discovered problems (three after having
observed four participants), and how many of those had occurred
just once (one). For this example, the adjustment is 0.5/(1 +
1/3), which equals 0.375. Next is the normalization procedure,
which is (0.5 - 1/4)(1 - 1/4) = 0.188. The average of these two
adjustments is 0.28—almost half the unadjusted value. The
correspondingly adjusted estimate of *n* is
log(1-0.90)/log(1-0.28) = log(0.1)/log(0.72) = (-1)/(-0.143) =
7—almost double the original estimate (but still not
terribly large).

As an exercise to the reader, what are the adjusted values for
*p* and *n* if you use the data from all eight
participants in Table 1? If you don’t want to
drag out the calculator with the log functions, try the
sample-size calculator at the Measuring Usability Web site
(http://www.measuringusability.com/samplesize/problem_discovery.php-[18]).

**The "Eight Is Not Enough" Example.** In 2001,
Spool and Schroeder published the results of a large-scale
usability evaluation in which they concluded that five users were
"nowhere near enough" to find all (or even 85 percent)
of the usability problems in the Web sites they were studying.
Perfetti and Landesman [17], discussing related
research, stated:

When we tested the site with 18 users, we identified 247 total obstacles-to-purchase. Contrary to our expectations, we saw new usability problems throughout the testing sessions. In fact, we saw more than five new obstacles for each user we tested. Equally important, we found many serious problems for the first time with some of our later users. What was even more surprising to us was that repeat usability problems did not increase as testing progressed. These findings clearly undermine the belief that five users will be enough to catch nearly 85 percent of the usability problems on a Web site. In our tests, we found only 35 percent of all usability problems after the first five users. We estimated over 600 total problems on this particular online music site.

Based on this estimate, it would have taken us 90 tests to discover them all!

The information provided in this paragraph shows that the
value of *p* in this study was very small. If there were 600
usability problems available for discovery given the study’s
method, then 247 problems are 41 percent of the total available
for discovery. Taking 1-(1-*p*)^{18} = 0.41 and
solving for *p* gives *p* = 0.029.

Given *p* = 0.029, the percentage of discovery expected
when *n* = 5 is 13.7 percent. In accordance with the data
reported by Perfetti and Landesman, 13.7 percent of 600 is 82
problems, which is about 35 percent of the total number of
problems they discovered with 18 participants (35 percent of 247
is 86).

For the conditions present in their study, it is not surprising that they continued to see more than five new problems with each participant. In fact, you wouldn’t expect the number of new problems per participant to fall below five until around the 45th participant. This is what you’d generally expect with a low problem discovery rate and a large number of problems available for discovery.

Their discovery of serious problems with later users is consistent with Lewis [12], which failed to replicate the early discovery of serious problems reported by Virzi [22].

The low incidence of repeat usability problems is also
consistent with low values of *p*. A high incidence of
repeat usability problems is more likely with evaluations of
early designs than evaluations of more mature designs. Usability
testing of designs that have already had common usability
problems removed is likely to uncover problems that are
relatively idiosyncratic, which seems to have been the case with
this study. Also, as the authors report, the tasks given to
participants were somewhat unstructured, which could have
expanded the space of problems available for discovery.

Their primary conclusion—that five or eight users aren’t
enough to discover 85 percent of the problems available for
discovery when *p* = 0.029—is well founded. On the
other hand, even with this extremely low value of *p*, the
expected percentage discovered with eight participants is about
21 percent, which is certainly better than not running any
participants at all. When *p* is this small, if the goal is
to discover 85 percent of the problems available for discovery,
then the required sample size is 62. If the goal is to discover
99 percent ("all") of the 600 problems, then the
required sample size is 140.

What we don’t know from this study is how likely it is to have
such a low value of *p*. The authors surmised that this
might be a characteristic of usability studies of Web sites, but
it could also be a function of the testing method or the level of
description of usability problems. Regardless, this example
illustrates the importance of computing an early estimate of
*p* and making an explicit decision about the desired
percentage of problem discovery as integral steps for rationally
determining the required sample size.

**Discussion.** We know a lot more about how to estimate
required sample sizes for usability problem-discovery tests than
we did 25 years ago, but I don’t believe that this knowledge is
very prevalent throughout the usability testing community, nor is
it widely taught to graduate students. I hope that recent
publications [14, 20] will
change the current situation.

There will, of course, continue to be discussions about sample
sizes for problem-discovery usability tests, but I hope they will
be informed discussions. If a practitioner says that five
participants are all you need to discover most of the problems
that will occur in a usability test, it’s likely that this
practitioner is typically working in contexts that have a fairly
high value of *p* and fairly low problem discovery goals. If
another practitioner says that he’s been running a study for
three months, has observed 50 participants, and is continuing to
discover new problems every few participants, then it’s likely
that he has a somewhat lower value of *p*, a higher problem
discovery goal, and lots of cash (or a low-cost audience of
participants). Neither practitioner is necessarily
wrong—they’re just working in different usability testing
spaces. The formulas developed over the past 25 years provide a
principled way to understand the relationship between those
spaces, and a better way for practitioners to routinely estimate
sample-size requirements for these types of tests.

**References**

1. Al-Awar, J., Chapanis, A., & Ford, R.
(1981). Tutorials for the first-time computer user. *IEEE
Transactions on Professional Communication, 24*, 30-37.

2. ANSI. (2001). *Common industry format
for usability test reports* (ANSI-NCITS 354-2001). Washington,
DC: American National Standards Institute.

3. Bradley, J. V. (1968).
*Distribution-free statistical tests*. Englewood Cliffs, NJ:
Prentice-Hall.

4. Bradley, J. V. (1976). *Probability;
decision; statistics*. Englewood Cliffs, NJ:
Prentice-Hall.

5. Chapanis, A. (1981). *Evaluating ease of
use*. Unpublished manuscript prepared for IBM, available on
request from J. R. Lewis.

6. Diamond, W. J. (1981). *Practical
experiment designs for engineers and scientists*. Belmont, CA:
Lifetime Learning Publications.

7. Hertzum, M., & Jacobsen, N. J. (2003).
The evaluator effect: A chilling fact about usability evaluation
methods. *International Journal of Human-Computer
Interaction*, 15, 183-204.

8. ISO. (1998). *Ergonomic requirements for
office work with visual display terminals (VDTs) - Part 11:
Guidance on usability* (ISO 9241-11:1998(E)). Geneva,
Switzerland: Author.

9. Kraemer, H. C., & Thiemann, S. (1987).
*How many subjects? Statistical power analysis in research*.
Newbury Park, CA: Sage.

10. Lewis, J. R. (1982). Testing small
system customer set-up. In *Proceedings of the Human Factors
Society 26th Annual Meeting* (pp. 718-720). Santa Monica, CA:
Human Factors Society.

11. Lewis, J. R. (1993). Problem discovery
in usability studies: A model based on the binomial probability
formula. In *Proceedings of the Fifth International Conference
on Human-Computer Interaction* (pp. 666-671). Orlando, FL:
Elsevier.

12. Lewis, J. R. (1994). Sample sizes for
usability studies: Additional considerations. *Human Factors,
36*, 368-378.

13. Lewis, J. R. (2001). Evaluation of
procedures for adjusting problem-discovery rates estimated from
small samples. *International Journal of Human-Computer
Interaction, 13*, 445-479

14. Lewis, J. R. (2006). Usability testing.
In G. Salvendy (ed.), *Handbook of Human Factors and
Ergonomics* (pp. 1275-1316). New York, NY: John Wiley.

15. Nielsen, J., & Landauer, T.K.
(1993). A mathematical model of the finding of usability
problems. In *Proceedings of ACM INTERCHI’93 Conference*
(pp. 206-213). Amsterdam, Netherlands: ACM Press.

16. Nielsen, J., & Molich, R. (1990).
Heuristic evaluation of user interfaces. *In Conference
Proceedings on Human Factors in Computing Systems - CHI90*
(pp. 249-256). New York, NY: ACM.

17. Perfetti, C., & Landesman, L.
(2001). *Eight is not enough*. Retrieved July 4, 2006 from
http://www.uie.com/articles/eight_is_not_enough/

18. Sauro, J. (2006). UI problem discovery sample size. Downloaded from Measuring Usability website, July 20, 2006-http://www.measuringusability.com/samplesize/problem_discovery.php.

19. Spool, J., & Schroeder, W. (2001).
Testing web sites: Five users is nowhere near enough. In *CHI
2001 Extended Abstracts* (pp. 285- 286). New York: ACM
Press.

20. Turner, C. W., Lewis, J. R., &
Nielsen, J. (2006). Determining usability test sample size. In W.
Karwowski (ed.), *International Encyclopedia of Ergonomics and
Human Factors* (pp. 3084-3088). Boca Raton, FL: CRC Press.

21. Virzi, R. A. (1990). Streamlining the
design process: Running fewer subjects. In *Proceedings of the
Human Factors Society 34th Annual Meeting* (pp. 291-294).
Santa Monica, CA: Human Factors Society.

22. Virzi, R.A. (1992). Refining the test
phase of usability evaluation: How many subjects is enough?
*Human Factors, 34*, 457-468.

23. Walpole, R. E. (1976). *Elementary
statistical concepts*. New York, NY: Macmillan.

24. Wixon, D. (2003). Evaluating usability
methods: Why the current literature fails the practitioner.
*interactions, 10(4)*, 28-34.

25. Wright, P. C., & Monk, A. F. (1991).
A cost-effective evaluation method for use by designers.
*International Journal of Man-Machine Studies, 35*,
891-912.

**Author**

James R. Lewis

IBM Corp.

jimlewis@us.ibm.com

**About the Author:**

*Jim Lewis has been a usability practitioner at IBM since
1981, working primarily on input methods (especially speech
input) and usability evaluation. He studied engineering
psychology and applied statistics at New Mexico State University
(MA, 1982) and psycholinguistics at Florida Atlantic University
(PhD, 1996). He has written several papers on standardized
usability questionnaires and sample-size determination and
recently wrote the usability testing chapter for the third
edition of the* Handbook of Human Factors and Ergonomics.

**Tables**

Table 1. Data from a Hypothetical Usability Test with
Eight Subjects, *p*_{est} = 0.375

**Sidebar: The Goal: Problem Discovery**

You can’t really talk about discovering 90 percent of all possible usability problems across all possible users, tasks, and environments. You can establish a problem discovery goal given a sampled population of users, a defined set of tasks, and a defined set of environments. Change the population of users, tasks, or environments, and all bets are off. But this is better than nothing. If your problem discovery rate is starting to go down, then change one or all of these elements of usability. Test from a different population of users, using different tasks, in different environments. You’ll discover different problems.

**Sidebar: Solution to the Exercise**

The Good-Turning adjustment is 0.375/(1 + 2/4) = 0.25. The
normalization adjustment is (0.375-1/8)(1-1/8) = 0.22. Their
average, the adjusted estimate of *p*, is 0.235, a little
smaller than the adjusted value at *n* = 4. The
corresponding adjusted estimate for *n* is
log(1-0.90)/log(1-0.235) = log(0.1)/log(0.765) = (-1)/(-0.116) =
8.6, which rounds up to 9. The hypothetical practitioner might
consider running one more participant, given the resources to do
so. If not, the practitioner can assess the adequacy of the
sample size by using the basic formula
1-(1-*p*)* ^{n}*. The estimated proportion of
problems discovered is 1-(1-0.235)

^{8}, which is 0.88 (88 percent)—only a little short of the goal of 90 percent.

**©2006
ACM 1072-5220/06/1100 $5.00**

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2006 ACM, Inc.

## Post Comment

No Comments Found