Forums

XVIII.6 November + December 2011
Page: 74
Digital Citation

Measuring fun, trust, confidence, and other ethereal constructs


Authors:
Dennis Wixon

It has become fashionable to promote the importance of creating interfaces that provide positive affective experiences. Fashionable trends often involve controversy. Self-appointed arbiters of fashion divide objects into the fashionable and the unfashionable. However, if creating positive experiences is so important, then figuring out how to do so should not be based solely on the opinions of a set of experts, regardless of how well qualified they may appear to be.

In this article, I focus on the challenge of creating interfaces that are fun. Fun is just one example of a large category of goals that go beyond the issue of how well an interface supports productivity. These goals involve users’ emotional reactions to an application, a tool, or a game with which they interact. In fact, these reactions relate to every experience a user has with a product. Did interacting with the product leave you feeling satisfied? Did it leave you feeling reassured? Did interaction impart a feeling of confidence? Did it leave you wanting to use the product again? Would you recommend the product to someone else? If so, whom? If it was a tool, did the tool feel solid and sturdy, like one you could trust not to let you down? Did the interaction leave you feeling that the tool is elegant and smooth in its operation? Did your interaction impart the feeling that this is a quality product? Or does it seem disposable?

These examples of possible user reactions to interactions with products represent the broad category of “user experience goals.” Product designers, user researchers, and business decision makers must be concerned with these user experiences. It’s critical for the product team to decide what user experience the product should be engendering. Once that decision has been made, it’s equally critical to know how well the product achieves that goal and if it is progressing toward the goal over the course of development. The more effectively and efficiently the team can determine how they are doing, the better.

There are some relatively straightforward, if imperfect, methods of achieving this in a more reliable way than simply relying on expert opinion or polemics. They do involve making assumptions at the outset of the process, and they also involve making decisions in the face of incomplete information. These methods and techniques are not the exclusive province of any single discipline; success requires a skilled cross-functional team working together. In this article, I propose a framework for this process and offer some practical suggestions. I also comment on some of the more polemical positions that have been taken in this field. The framework and suggestions are rooted both in theory and the practical experience of successful product teams. They also use some common sense. Finally, I do not claim that my suggestions represent the definitive approach. However, they represent an approach that has worked well for dozens of products over the past 10 years.

A Few Common-Sense Assumptions

Some people maintain that fun or any other subjective experience cannot be measured, as if they were ethereal states outside the realm of science or engineering. In a sense, that is correct, but only in the sense that measuring anything involves some loss of information. Thoughtful consideration of even such well-established measurements as distance demonstrates this loss of information. We have straightforward ways to measure distance using standard tools and well-defined procedures. These measurements meet their primary goal, which is to produce agreement among people. We all agree that a marathon is 26 miles and 385 yards long. That is a common definition, made possible by a set of standardized tools and an agreed-upon process of measurement. The practical result is that anyone can lay out a marathon course. It also means that marathon participants can track their progress and share experiences. But while distance is the definition of a marathon, runners want to know more—for example, altitude change. Marathons on flat ground differ from ones with many hills. Indeed, courses are often described in terms of altitude change.

In this respect, measuring user responses to a product is similar to measuring distance, in that we can propose tools and procedures for measuring those reactions, even though they do not capture everything. The critical element is that the development team agrees on the measures at the outset of the project. That agreement will allow the entire team to have a common definition from the beginning of their quest. The definition may omit elements of the user’s experience, but that is not a critical flaw. In fact, any measurement can be supplemented with additional measurements to enhance the usefulness of the entire measurement framework. Similar to a runner contemplating a marathon course that has been measured using certain procedures, a product team can know the challenge to begin with and can plan accordingly. If we accept the argument that experiences can’t be measured, we condemn teams to chaos because everyone will have their own private definition of fun, to an arbitrary, authority-based approach, or to a decision based on functionality and/or schedule. It would be like allowing a judge to determine who won a marathon regardless of the speed or distance that each runner actually ran simply because time ran out.

Some maintain that adopting a definition of fun or any other user evaluation will stifle the creativity of the team. Again, such an argument is an overgeneralization. A team with a shared goal in mind and a way of assessing progress is constrained, but not completely. Let’s consider the example of a game. In a game the goal is defined. Does that constrain the players? Of course it does. But does it completely constrain them? Of course not. Every game of chess is not completely determined just because the goal is the same. It just means that the creativity of the players is channeled and directed. In fact, it could be argued that having a clear goal and some rules facilitates creativity. In practice, the team will be constrained in any case by resources, time, and ultimately by the response of users to their creation. Having a shared knowledge of the experience goal they are aiming for and a way to assess their progress will help the team, not only by bringing them together, but also by precluding the common error of confusing ends and means. For example, in the world of game development, many products fail because their creators confuse a means, such as including really cool monsters, with an end, such as users having fun and feeling challenged. Great development teams care about user experience and channel their creativity toward creating that experience. Having a measurable goal can help them achieve that result. It’s not the only way, but it’s a good way, and it has worked for many teams.

Finally, it can be argued that the act of measurement creates its own response. Some folks carry it to its extreme, arguing for a kind of Heisenberg Uncertainty Principle of user evaluation: You can’t measure the phenomenon without changing it, so you should not try. Again, this argument contains a germ of truth but is radically overgeneralized. In physics, the Heisenberg Uncertainty Principle states that you can’t know both the speed and the position of an electron at the same time. It doesn’t say measurement is impossible and useless. Physicists did not stop measuring phenomena; they found other ways to make measurements. Certainly, researchers need to be concerned about the possibility of biasing an observation through the way they measure it. This is particularly true of evaluations. But let’s assume for a moment that researchers are honest and that it is in their best interest to create unbiased measures. I think we can also assume that users make evaluations of products they use. The evidence suggests that they do so relatively quickly and that these evaluations are based on their experience with and expectations of the product. Thus it can be argued that a measurement approach, such as a research questionnaire, is simply encouraging a process that occurs anyway. Admittedly, this measurement may be more in-depth and structured than the one the user would typically take, but it’s not different in principle. And we must standardize this measurement so that we can agree on what we are looking for.

In summary, I have advanced the following arguments or assumptions:

  • User reaction is like any other quality: It can be measured.
  • Having a measurable goal for a product will be beneficial to a team.
  • Measurement will not stifle the creativity of a team but rather will enhance it.
  • Evaluation is part of using products anyway, so we might as well systematize it.

Some Process Suggestions

If the reader accepts these arguments, the next logical question is, “What’s a reasonable way for a team to effectively agree on an experience goal and a way to measure it?”

At the outset, the team can agree on what experience they want the user to feel when they use the product. This is not as straightforward as it seems. For example, many people would argue that the goal of all games is fun. And true, what is fun depends on the person. However, that does not make the concept of fun meaningless or necessitate a descent into solipsism. Again, take the example of a marathon and think about measuring how challenging the course is. The fact that rating the degree of challenge is different for different people under different conditions does not make the concept of challenge unknowable or unmeasurable. And even though the measure is imperfect, courses can be compared in terms of how challenging they are. The same is true of evaluating experiences. If a team wants to create a feeling that this is a quality product, they would do well to define their target customer and then propose a measure of quality that these users can assess.

Researchers may well ask about the reliability and validity of such measures. For practical research projects intended to serve the needs of product development, these are not simple questions and depend on the goals of the research and the constraints on the team. In commercial development, “reliable” and “valid” often relate to predicting user response to the product when it is released. They may also relate to comparing the product under development to other products in terms of user response. Such prediction is analogous to predictive validity. “Validity” can also refer to credibility with the product team, which is more like concept validity. Because we are dealing with evaluation of products under development (that are therefore constantly changing), concepts like test-retest reliability may be less applicable.

It is important to point out that these metrics become useful only in context. For example, if a game scored 3.5 on a fun scale, how do we know if that is good or bad? If we were doing benchmark tests, it would depend on how it compared to other games of a similar type. If we were doing iterative development, how a game scored would depend on how the previous build of the game tested. This dependence on context is obscured for many familiar measures because we already know the context. For example, we know that a marathon is a long distance to run. In our hypothetical example, a product team would benefit from collecting data on any existing competitors to provide context for their measure.

Does this mean that evaluative measures like fun or trustworthiness are meaningful or sufficient in and of themselves? Not really. They gain meaning and credibility in combination with behavioral measures and direct observations. When we combine a number of indicators, such as behavioral measures (for example, the mission takes twice as long as the designers intended), direct observation (for example, users consistently backtrack and report that they are lost), and an evaluative rating (for example, users rated their fun in using the application as low), then we can be much more confident that together these measures constitute a true picture of the user experience with the product.

Iteration also helps validate our measures. As we make changes to the product based on our observations and measurements and see improvements, we are not only in the process of meaningfully achieving a useful business goal (making a good product), but are also confirming our theories about how various aspects of the interaction influence people’s feelings about the product. In fact, characteristics of the product itself (its visual and interactive design and its functionality) constitute the independent variables, and the measures we have created constitute dependent variables. The assumed (and hopefully verified) relationship among these is analogous to a theory.

One key difference between traditional measurement and measurement for products is the measurement for products needs to be actionable. It’s not enough to tell a team their design is doing well or poorly. The measurement process should help them know what to change—and what not to change—to improve the product. Fortunately, getting an inkling of what to change is not difficult. I have found the following approaches to be highly productive.

First, as part of a testing suite, don’t focus just on a global assessment score; also include an assessment of factors that may drive or contribute to it. For example, if you are testing games using a question-naire and asking about fun, also ask about potential drivers of fun, such as predictability of controls, effectiveness of feedback, and the like. Also include a qualitative question with every quantitative question, for example, by asking the respondents to describe in their own words what made this game fun or not fun. That way you get users’ perceptions of what drives their evaluations. These are not definitive, but they still provide important clues to understanding the determinants of the experience. Of course, observing the users that you test will also help you make interpretations about the drivers of their evaluations.

While it may seem that such measurement approaches are particularly relevant to games, these suggestions apply to any product. Knowing the primary experience goal, assessing possible drivers, combining quantitative measures with qualitative questions, direct observation, video recording, and logging are all good ideas broadly applicable.

It may also seem that by advocating for relatively simple evaluative and behavioral measures, I am implicitly arguing against more physiological measures, such as galvanic skin response, pulse, or brainwaves. I am not arguing against such measures in principle. Rather, I am arguing against their presumed superiority. The idea that these measures gauge “true” emotions is absurd. They are just like any other measure, in that they measure something we hope will be related to the thing that really interests us. They need to meet the same criteria as evaluative measures. That is, the team must agree on their relevance, and they then need to be useful in comparing or improving products. They are as vulnerable as any other measurement type to the objection that collecting them may change the experience. Also, collecting them may result in the whole test process having less ecological validity. Few people use products wearing a skull cap that reads their brain waves. They also have a critical limitation, in that their relationship to product goals (for example, fun, satisfaction, and confidence) cannot be determined without additional data. Does an increase in heart rate mean the game is more exciting or more frustrating? Does a lower heart rate mean I trust the software more? The physiological measure may not add incremental value to the additional data we need to gather anyway to make these interpretations.

Finally, the application of these measures in the context of the development of real products is limited. In contrast, evaluative measures and the approach outlined here have a long, proven track record of improving products and leading to the changes that bring the experience closer to the designers’ intended outcome and contribute to the product’s commercial success. In fact, these recommendations are distilled from observing product teams that have built successful products. They are methods that have been honed over the years in the highly competitive space of computer gaming. And though they are probably not the only way to create commercially successful and emotionally compelling products, they are an effective way. Specific examples have been published in a number of venues [1,2,3].

An Appeal to the Community

Beyond advocating for some common-sense assumptions and procedures, one of my goals is to elevate the discussion of affective design beyond its current absolutist and polemical impasse, in which some seem to assert that we must rely on the subjective judgment of the designers or other “experts” such as senior managers. To say you can’t measure fun or any other emotion or complex construct defies logic and contradicts history. It’s equally wrong to reject evaluation of affect simply because no perfect, standard measurement scale exists. Simple, pragmatic measures, especially combinations of types of measures iterated over time, are quite useful, and far from squelching a team’s creativity, they can help focus and inspire it.

It is time to move the discussion of measuring affective user response beyond platitudes and anecdotes. Providing well-documented case studies would also move the discussion forward. Moreover, it would be helpful for research institutions to partner with design studios and publishers on applied research efforts. Fortunately, this field is relatively new, so positions have hopefully not yet hardened. In addition, the pursuit of understanding people’s subjective and affective experiences with a broad spectrum of computer products is being recognized as increasingly important. For example, people talk about the “gamification” of products. People also speak of applying the approaches of computer games to education and health problems. The need for a productive approach that brings all disciplines into a meaningful conversation exists. If we adopt a stance of dialog along with an empirical and case-study-oriented approach, I believe we can make significant progress. It’s up to us.

References

1. Kim, J., Gunn, D., Schuh, E., Phillips, B., Pagulayan, R., Wixon, D. Tracking real-time user experience (TRUE): A comprehensive instrumentation solution for complex systems. Proc. of the 26th Conference on Human Factors in Computing Systems (Florence, Italy, April 5–8). ACM, New York, 2008, 443–451.

2. Romero, R. Successful instrumentation: Tracking attitudes and behaviors to improve games. PDC 2008; http://download.microsoft.com/download/7/6/a/76ad49f8-cf47-47d3-8894-df30c67e5c1f/GDC_2008_RamonRomero_SuccessfulInstrumentation.pps

3. Thompson, C. Halo 3: How Microsoft Labs invented a new science of play. WIRED, (Aug. 2007); http://www.wired.com/gaming/virtualworlds/magazine/15-09/ff_halo

Author

For the past 12 years, Dennis Wixon has managed research for a number of user research teams at Microsoft. He previously worked as a usability manager at Digital Equipment Corporation, where methods such as usability engineering and contextual inquiry were developed. He and Daniel Widgor are authors of the book Brave NUI World, about designing touch and gestural interfaces.

©2011 ACM  1072-5220/11/11  $10.00

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2011 ACM, Inc.

 

Post Comment


No Comments Found