A failed launch. If you work in human-computer interaction (HCI) long enough, you'll experience one. It's inevitable that all of us, at one time or another, wind up working on a team that doesn't quite hit the mark. Design projects can fail for many reasons—some within the HCI professional's control and some not. The case study described here covers only a few, but it also brings up questions about larger themes that have been discussed in our industry for many years.
This case study concerns the public-facing website of a multi-state healthcare system headquartered in the Midwestern U.S. Although the majority of its more than 40,000 employees are clinical practitioners and hospital and practice administrators, the corporate office is substantial and well staffed.
In December 2015, the organization hired me as a consultant to conduct a lab-based, moderated, qualitative usability study to ensure a planned redesign was on the right track. This study consisted of 11 one-hour, one-on-one sessions in which participants were asked to complete four tasks with each of two clickable prototypes (one representing the desktop experience and one representing the mobile experience) created in InVision.
The results of this qualitative study were mixed. Participants had positive reactions to the visual design of both the desktop and mobile home pages, but they often didn't notice a key piece of information: which geographic area the site assumed they were in. Regarding the content of pages such as physician detail and facility detail, participants again had largely favorable opinions. For the screen showing the results of their searches (for a physician, hospital, or service), participants experienced a substantial amount of friction, but much of the difficulty was determined to be associated with confusing terminology and a few specific interaction design patterns in the filtering mechanism. Overall, we found that some parts of the prototype tested really well but that the issues of geographic area, filter presentation, and terminology should be subject to additional research.
Approximately seven months later, in July 2016, the redesign was just about ready to launch. To gather baseline metrics on the current live site, we set up a remote, unmoderated usability study using Loop11. In the baseline study:
- There were 100 participants from the four states in which this organization primarily operates, recruited by a third party.
- Each participant completed four tasks: 1) find a doctor, 2) find an urgent care facility, 3) find information about a specialty service (e.g., cancer care), and 4) determine how to pay a bill.
- There were four follow-up questions to help validate that the participants found the correct information.
- Participants completed a System Usability Scale (SUS) questionnaire.
- Participants answered a Net Promoter Score (NPS) question.
The baseline study results were disheartening (Figure 1). Overall task success was only 40 percent. While a "good" SUS is subjective, one published average is 68 , and the baseline SUS was 61.2. A "good" NPS is also subjective. The scale ranges from −100 to 100, and published averages by industry range from 2 to 58 . The NPS for this baseline study was −7, far below any of those published averages.
|Figure 1. Results of the baseline quantitative usability study conducted in July 2016.|
While the team didn't consider these figures desirable, it did make us hopeful that the comparison numbers we would collect after the launch of the redesign would show some improvement.
That comparison study was conducted in December 2016, approximately five months after the redesign was launched. This delay between the launch and the study allowed the development teams to work out any technical issues that surfaced and helped ensure that quality assurance issues weren't going to affect our comparison figures. As in the July 2016 baseline study, we used Loop11 to collect data. In the comparison study, there were 106 participants, recruited from a different provider using the same criteria as in the baseline study. These participants were given the same tasks, follow-up questions, SUS questionnaire, and NPS question as in the baseline study.
If the baseline results were disheartening, the comparison study results were downright depressing (Figure 2). Task completion dropped from 40 percent to 34 percent, SUS dropped from 61.2 to 48, and NPS dropped from −7 to −75.
|Figure 2. Results of the comparison usability study conducted in December 2016.|
So what happened? This is an organization that claims to adhere to a user-centered approach to design and development, and that validated their design direction with plenty of time to pivot before development and launch. How could such a company end up with a digital product that end users, by all objective measures, found more difficult and less satisfactory than the one it replaced?
In the end, what we really had was the illusion of a user-centered process. My brilliant colleague Joe Grant calls this Shadow UX.
Yes, the team had done several things right, and they used the right vocabulary. There were several known shortcomings in the process, but at the time they were considered inconsequential. And perhaps several of them were, when considered individually. The cumulative effect of these shortcomings, however, was quite serious. We actually had a sizable gap in understanding. And even worse, we weren't fully aware of the magnitude of that gap or of its importance.
So what is Shadow UX? Essentially, it's anything that presents the illusion of a user-centered design process without actually providing the benefits of that process. Usually, it involves some combination of three issues. As it turns out, this case includes flavors of all three:
- You're not doing enough user-centered design activities.
- You're not doing the right user-centered design activities.
- You're doing the right user-centered design activities, but you're doing them poorly.
Let's assume that for a given project, [A B C D E F G H I J K L M N O P] represents an ideal set of user-centered design activities, such as contextual inquiry , card sorting, tree testing, 3x3 , and the like (Figure 3). But then we all agree that ideal doesn't exist, so we can settle on [A C D E G H J L M N P] as still pretty good. But maybe our budget or timeline is really constrained, and we need to settle for [A C D G H L M N]. Here we're pushing the boundaries of Shadow UX. But what about [A D H M]? Or [A D M]? Or [A D]? What if you do [A C D H L N], but you do A and D without proper rigor (say, testing with "near users" or with highly biased task phrasing)? Where does that fall on the Shadow UX spectrum? At what point are you no longer doing "real" or "good" UX? At what point do you think you are but really aren't, which is where some real danger can lie, as it did in this case?
|Figure 3. Conceptual rendering of a quality spectrum of UX processes.|
Did we do enough user-centered design activities? In this case, of course, we did only [M]. One qualitative study certainly isn't enough. With just 11 exposure hours  to end users, the team never really stood a chance of creating a useful, usable, satisfying product. Certainly, we were not doing enough user-centered design activities.
But were we doing the right activities? Yes and no. If you could do only one activity, a qualitative study, not unlike the one we did, would usually be my recommendation. But in this case, we should have done it with a prototype of a lower level of fidelity. The InVision prototype used for the qualitative study had a high level of visual fidelity but a low level of actual interaction fidelity—it was a clickable stream of static screens.
In the end, what we really had was the illusion of a user-centered process.
The high level of visual fidelity both distracted from more important workflow issues and contributed to our false sense of security, because we heard positive feedback about the visual design. The low interaction fidelity meant we weren't able to test all of the right things. It was important to understand whether and how people could parse the content on the physician detail page, but it was equally important to understand how they would use the filter mechanisms in the search results (which "worked" only one way, so participants essentially couldn't make mistakes). And it turned out to be tremendously important for us to understand the magnitude of that geographical area issue. We were able to note that there was confusion around it, but the very limited prototype didn't facilitate an understanding of the consequences of that confusion, whether users would notice their mistake, and what they would do (if anything) to correct their errors.
Did we do the activity poorly? Although we followed generally accepted best practices in managing the recruiting and in minimizing bias in conducting the sessions, we made some mistakes. Certainly, in addition to the wrong level of fidelity, the prototype was also very limited in scope, such that there was often only one path through the screens: the correct path. So even if participants fumbled, they typically made their way to the right answer. Further, in retrospect, the interpretation of the findings did not draw sufficient attention to the risks posed by the limits of the prototype. The response "We couldn't determine X, so we should do more research" was often given. Instead, "We couldn't determine X, so we should do more research. If that doesn't happen, the project risks Y by not truly understanding X" would have been more appropriate.
Beyond the three basic sources of Shadow UX, this project also had some compounding factors. One was this organization's UX maturity. Another was this consultant's lack of understanding of that maturity.
As mentioned earlier, the organization has some dedicated practitioners on staff, and they have run a number of quantitative and qualitative tests in the past. They've asked me to consult on several occasions, and I know they've worked with other consultants and agencies as well. Yet they still didn't have enough UX maturity to know how detrimental this one-big-study approach could be.
Further, I overestimated their level of knowledge and understanding. Because they had done so much UX work in the past (especially compared with a lot of other non-tech companies in the Midwest), I made an assumption (ugh!) that they fully understood the limitations of doing only one study, and that they were likely doing other UX activities during this effort aside from the one they asked me to handle.
Finally, I was frankly concerned about the appearance of recommending further testing or pushing too hard for it. What client reads "additional testing is needed" and doesn't immediately think "Sure it is, consultant who wants more billable projects"? Because I was trying to avoid sounding greedy or slimy, I left the additional-testing recommendation as a single sentence instead of emphasizing the risk of our poor understanding.
So, is some UX better than none? The debate continues. While it's true that "even the worst test with the wrong user will show you important things you can do to improve your site" , the people conducting and acting on that test must also understand the limitations along with the advantages. They need to know they're seeing only some things, and that the risk of skipping additional testing is that there may still be important things missed. For those of us who do understand, we need to do everything we can to ensure that project stakeholders don't get a false sense of security when practicing Shadow UX.
2. Sauro, J. The system usability scale; https://measuringu.com/sus/
3. Comparing your net promoter score. Net Promoter Network; https://www.netpromoter.com/compare/
5. Righi, C. Building the conceptual model and metaphor: The "3X3". In Design for People by People: Essays on Usability. R. Branaghan, ed. Usability Professionals' Association, Chicago, IL, 2001, 213–219.
6. Spool, J.M. Fast path to a great UX – increased exposure hours; https://articles.uie.com/user_exposure_hours/
Danielle Cooley has been working in design research and strategy for more than 18 years with such companies as Hyundai, Pfizer, Graco, Enterprise Rent-a-Car, Fidelity, and MasterCard. She runs a bespoke consulting practice and speaks frequently at local, national, and international conferences. firstname.lastname@example.org
Copyright held by author. Publication rights licensed to ACM.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2017 ACM, Inc.