When A/B testing gets an F

Authors: Jonathan Grudin
Posted: Tue, July 02, 2013 - 11:33:18

A relationship is like a shark, it has to constantly move forward or it dies. And I think what we got on our hands is a dead shark. —Woody Allen, Annie Hall

Like sharks in search of their next meal, living websites constantly move forward. How do they decide where to go? Many popular sites rely on A/B testing. Different versions of a feature, layout design, or advertisement are presented to thousands of users. What people try, how long they remain, and whether they click through are logged. Although called A/B testing, more than two alternatives can be compared.

This is a modern variant of a familiar controlled usability experiment. In the 1980s, my first usability studies required weeks to recruit 10 or 20 participants and get them into the lab. Today, it can take just minutes to identify preferences with statistical reliability. The website is the laboratory and the participants are unpaid users. The winning design can be rolled out to everyone. Refinement and evolution can proceed as rapidly as designers can generate ideas. Life is good.

Continuous product release

Turn and face the strain —David Bowie, “Changes”

Adam Pisoni, co-founder of the enterprise social networking service Yammer, lists companies that did not move forward and ended up dead sharks: Blockbuster, Bethlehem Steel, Tower Records, and so on. Pisoni is a forceful advocate of A/B testing. He notes that constant change conflicts with predictability, which companies traditionally relied on, but argues that today there is “this larger thing going on in business, this issue of predictability versus adaptability… As the world changes really fast, and things are changing, predictability becomes counterproductive.”

However, we’re creatures of habit. Habits can engender efficiency. A habit relies on the world having some predictability. Software designers fear the wrath of the “installed base”—a ball and chain for legacy software. A new company can have an advantage when introducing a novel interface: It won’t earn a reputation for abandoning customers.

Pisoni acknowledges the problem of a rapid pace of change: “The way we built started impacting customers in weird ways. We release constantly. We’ve always released at least once a week if not more. Customers started coming to us saying ‘man, we love how easy it is to use. However, we want you to build it differently, we want you to build it the traditional way, give us 3-year timelines and all that.’”

No will do. Pisoni’s goal is to go from weekly releases to continual release. Yammer is not alone in turning to face the strain. Facebook has used A/B testing and weekly pushes as it adds, removes, or changes features. Its 2006 introduction of the News Feed dissatisfied many existing users who saw clutter, yet Facebook weathered the storm and prevailed. A/B testing also drives the evolution of Google, Bing, Amazon, and other Web products.

A/B testing works best if you know what to measure. Who knows better than advertisers? They can determine which design increases click-through or purchases. For them, A/B testing gets an A. Well, it may not reveal how often products are returned for refunds, or the likelihood of repeat business. Let’s make it an A-.

Hey?

The Obama team used A/B testing extensively in 2012. The subject header of one solicitation read simply, “Hey”. It surprised many people. It annoyed some, but A/B testing showed it was remarkably effective at drawing contributions. Money talks. I received nine “Hey” emails, from Barack Obama, Joe Biden, and other close friends.

The Obama campaign had one and only one goal: A majority of the electoral votes on November 6. The money contributed to reaching that goal. Of course, for most people Obama’s election was a means to another end—strengthening the Democratic Party, a progressive agenda, or something else. What was the effect of “Hey” on these goals? A/B testing optimizes for the here and now; when will local optimization end up hurting in the long-term?

I was annoyed and burned out by the extraordinary barrage of email marketing late in the campaign. On November 7, I began removing myself from scores of Democratic and progressive distribution lists. I welcomed the election outcome but wanted a year or two to recover. A classic tragedy of the commons; I was the overgrazed commons. Did the pitcher go to the well once too often? I was not alone in my flight to solitude. Check with us in a couple years. A/B testing here earns an Incomplete.

When A/B testing can get an F

In the May 14, 2001 issue of The New Yorker, a perceptive article by Tad Friend, “The Next Big Bet,” discussed the HBO television series Six Feet Under and contrasted the radically different business models of commercial television networks and HBO. His analysis provides insight into where A/B testing of a general population can go wrong—and how A/B testing could benefit from supplemental techniques.

The networks sell advertisements. How much they can charge depends primarily on how many viewers a show attracts. Every show strives to attract the largest possible audience, with perhaps some attention to age or income. A new show that draws precisely the same audience as a very successful existing show will be very successful. Hence, there were five different Law and Order series, three CSI series, and an unending progression of reality TV shows and sports coverage.

HBO relies on subscriptions, not ads. They want shows that appeal to the greatest common denominator—sports, crime series, reality shows, whatever. But once they have a strong success in a genre, an imitation may not attract new subscribers. More valuable is a novel show that appeals to a niche market that has not yet subscribed. Consider a very popular show that appeals to 30% of the potential audience. If HBO creates another show that appeals to the same 30%, it may get no new subscribers. A new show that appeals to a different 10% of the potential audience may attract millions of new subscribers. Hypothetically, if HBO had 10 shows with a powerful magnetic appeal to a different 10% of viewers, they might get everyone subscribing, even if no one show appealed to 30%.

This is where A/B testing alone can flounder. A/B testing on the existing user base may not detect something that will appeal to a niche that has not yet subscribed, and testing that identifies popular choices could provide six reality shows that appeal to the same 30% of the market. Each 10% niche show will lose against a 30% show, when cumulatively they would attract more subscribers.

This is not necessarily constrained to the television world. Let’s consider Facebook and Yammer. Are they more like the commercial television model or the HBO model today? Tomorrow?

Facebook has constantly moved forward. It swims in a sea on which many dead sharks float. Supported by A/B testing, Facebook made solid decisions. Few abandoned it and more flocked to join. Like the television networks, Facebook relies on advertising revenue. It wants eyeballs. A one-size-that-fits-the-greatest-common-denominator strategy may work. If Facebook adopts design A and leaves behind the minority who preferred B, it may be OK to lose some niche participation.

Facebook does lose niche participation. I’m in one of those niches. Facebook took away the two features I liked most, so I use it less. One was a presentation feature, one was a view. My original Facebook profile listed my favorite books, in three categories; my favorite films, also in three categories; music; a set of my favorite quotations; and so on. It was a personal statement that some people noticed. Facebook removed most of it entirely. Some could be partly reconstructed in a less compact, less easily scanned format. The once-prominent quotations exist but you probably can’t find them.

My favorite view listed in reverse chronological order the most recent post by each of my friends. This was a wonderful way to catch up quickly on everyone without being bogged down by those who post minute by minute accounts of their trips to get a latte. For whatever reasons, this view disappeared.

A/B testing must have shown Facebook would prosper without my pride in profile and my attention. It has. My niche was small.

How will this strategy fare in the long run? Will maximizing the eyeballs delivered to advertisers succeed? Might they create opportunities for HBO-like sites that appeal to niches such as mine? I consider the possible evolution of online sites in the next section, after a look at Yammer.

Yammer links employees within an organization. It wants to attract new organizations, and is thus more like HBO. However, with A/B testing across its customer base and frequent interface changes, it is betting on a greatest-common-denominator strategy. This could be a problem if different interfaces would appeal to different companies or industries; for example, if markedly different feature sets would appeal to financial companies, medical companies, and tech companies, or if cultural or regulatory differences would affect feature preferences. Within a company, A/B testing could miss major differences: Perhaps marketing and sales groups would flock to something very different from design and engineering. A/B testing could favor the preferences of the more numerous young, adaptive individual contributors, but the niche comprising executives and managers who desire slower, more predictable change could be significant for an enterprise service.

We don’t know—it is early days. But assume that Pisoni’s broad A/B testing delivers changes that appeal to 10% of every company. Customized interfaces would be more complex to design and manage, but they might appeal to 50% of each company. This is a classic market segmentation tradeoff. Perhaps 10% per organization is enough to sustain use and deliver on enterprise goals. But if 10% is below the critical mass to sustain use or if the goals require higher participation rates, the outcome is not so great. And even in the former case, niches might be created for competitors who provide features that appeal to more than 10% of the employees.

This is of course speculative. But the analysis suggests techniques that could supplement A/B testing to provide a more versatile process. Before concluding by discussing these techniques, let’s briefly consider the history of mass market versus niche solutions.

Market segmentation and a vulnerability of A/B testing

When a desirable product is first widely available, having it is a pleasure and owning it is status enough. Interface details are secondary. Henry Ford famously wrote of the Model T, “Any customer can have a car painted any color that he wants so long as it is black.” Ford focused on reliability and efficiency, but he was also a fanatic A/B tester, in a slower pre-Internet era. One size fits all worked well for a time, but eventually General Motors catered to the niches—those who wanted luxury, something sportier, or just a different color. It is more expensive to produce multiple brands, but General Motors became the larger company. Similarly, indistinguishable Timex watches and black telephones were immensely popular, but eventually Swatches and a competitive phone market thrived on personalization.

Differentiation and personalization are in our nature. Our prehistoric ancestors developed different cultures and languages. They ornamented themselves. For a time, having a Facebook profile was a personal statement. When everyone is a Facebook member, more complex market segmentation will inevitably become important.

A/B testing will not necessarily mislead or cease to contribute, but it won’t be enough to earn an A and its affordances could be unfortunate. Rapid change works best if users do little customization. The more variation, the more a product becomes a platform, the messier change can be. My highly customized profile was blown away by Facebook changes. When individual contributors and managers prefer different interfaces, as they often do, a change can disrupt one or both. A/B testing in practice pushes gently toward “any color as long as it is black.” But cultures, organizations, and individuals like to customize.

Supplementing experimental approaches

I recently visited a school in which students used a particular device. They told me what they would like changed. Weeks later, I was having dinner with an employee of the company that made the device and suggested they visit the school. “We won’t do that,” she said with a rueful laugh, “we just do A/B testing.”

At the point market segmentation becomes significant, you have to get out in the field to identify the segments and learn how they work. Today, with technology supporting our lives in ever finer detail, understanding the subtle effects requires getting out and looking closely. This is a golden age for quantitative exploration, for big data, and it is also a golden age for qualitative exploration. Qual and quant enthusiasts sometimes regard one another with suspicion, but individuals or companies that learn to use them together will win. Quantitative data can provide suggestions about where to look in depth, qualitative data will provide hypotheses about what is happening that quantitative data can then confirm, refute, or refine. A/B testing applied within market segments can deliver the power to determine whether different interfaces are needed or whether one—and which one—will suffice. As a partner, A/B testing could be back on track to getting an A. A/B testing that is not informed by the big picture, that is not supplemented with strong qualitative research, could get an F.

You better start swimmin' / Or you'll sink like a stone / For the times they are a-changin' —Bob Dylan

This post benefited from discussions with and ideas from Gayna Williams, and from an exchange with Michael Bernstein. Adam Pisoni material is from the cited link and a keynote that is not available online, used with permission.

Posted in: on Tue, July 02, 2013 - 11:33:18

Jonathan Grudin

Jonathan Grudin has been active in CHI and CSCW since each was founded. He has written about the history of HCI and challenges inherent in the field’s trajectory, the focus of a course given at CHI 2022. He is a member of the CHI Academy and an ACM Fellow. [email protected]
View All Jonathan Grudin's Posts

Post Comment

@Jonathan Grudin (2013 07 02)

The Wikipedia article on A/B testing identifies major companies known to use the technique, ranging from Amazon & BBC to Walmart & Zynga.

ACM Interactions

Blogs

Jonathan Grudin

Post Comment

Reader Tools

Top 10 Bloggers

SIGN IN