**Authors:**

Maximilian Speicher

In my previous job, I was responsible for all A/B testing in an e-commerce department. We worked with an external partner and took care of the whole package—from generating hypotheses to developing the tests to providing a report of the results. Those reports were presented to a variety of stakeholders, including product management, merchandise, and the director of e-commerce. To conduct and analyze the tests, we worked with a dedicated experimentation platform that calculated all the numbers per treatment in the traditional, frequentist way. We had the number and share of visitors, values for the different metrics that were measured, a single number for the uplift (or effect size) of the primary metric, and a percentage denoting "confidence." Our reports displayed more or less the same numbers, but wrapped in a bit of story, with a recap of the test hypothesis and design and recommendations for action. This practice, however, unveiled a number of issues and misunderstandings.

First, it was problematic that we communicated a single number for effect size. This was misunderstood by many stakeholders (and I can't blame them for that) as the uplift we would *definitely* make, given that the test was "significant" (see below). But that number denoted only the maximum likelihood estimate. The uplift expected by stakeholders deviated from the true expected value. Sometimes, after implementing a successful test, this manifested in questions such as, "Why don't we see the 5 percent uplift that we measured during the test in our analytics numbers?"

Second, in tandem with the above estimate, confidence was also often misinterpreted. In our case, it was *1–p*, with *p* being the probability of results at least as extreme as the observed one, if there were no difference between control and treatment. And while this cannot even be interpreted as the chance that there's any uplift at all [1], many stakeholders mistook it for the actual chance to achieve the reported maximum likelihood estimate ("The test has shown a 96 percent chance of 5 percent uplift"). Truth be told, we didn't do a very good job of explaining all the numbers—p-values in particular are just so far from being intuitively understandable [1]—so part of the blame was on us. Yet, reporting something along the lines of "We're confident there's a chance of uplift, but we can't exactly tell how much" is also not an option in fast-paced business environments with short attention spans. It's a catch-22.

Third, and most problematically, there was always the magical and impeccable threshold of "95 percent confidence" or "95 percent significance" (i.e., a significance level of α=.05) in stakeholders' minds. If a treatment had a confidence of ≥95 percent, it was considered good; otherwise it was considered bad, no further questions asked.

All of the above led to the following decision model when a test was completed:

`if (confidence ≥ .95 && uplift) {implement treatment;`

`} else {`

`keep control;`

`}`

And this is just plain wrong.

First of all, one can argue that a significance level of α=.05 is rather arbitrary. After all, economists usually work with α=.1 in their experiments, so why not use that? Still, "95 percent significance" was etched into many of our stakeholders' minds simply because it's the most widely used threshold. Now, working with such a relatively tight threshold might rob us of a lot of information, since for a result that falls short of the magical 95 percent confidence, the only thing we can really deduce is that we can't reject the null hypothesis at the given significance level—and especially not that the control should be kept!—and if we're above 95 percent, we can't even reliably communicate what uplift one's probably going to make. In the else part of the above decision model, we've probably discarded many a treatment that would've proven good if we hadn't relied on p-values so much.

We've probably discarded many a treatment that would've proven good if we hadn't relied on p-values so much.

To make things worse, at a more fundamental level, a p-value from a *single* experiment is very much meaningless. Steven Goodman writes, "[T]he operational meaning of a *P* value less than .05 was merely that one should repeat the experiment. If subsequent studies also yielded significant *P* values, one could conclude that the observed effects were unlikely to be the result of chance alone. So 'significance' is merely that: worthy of attention in the form of meriting more experimentation, but not proof in itself" [2]. Additionally, p-values can vary greatly between repetitions of the same experiment [3,4].

In other words, to a certain degree it's *simply random* if a single result lies on this or that side of the magical threshold.

That being said, as Jakob Nielsen explains, there are sometimes very good reasons to implement a treatment, even if it did not achieve a "significant result" [5]. After all, significant or not, there's always a certain probability for a certain uplift, which has to be weighed against implementation and opportunity costs [5]. But if maximum likelihood estimates and p-values are not suited to communicate that and facilitate informed decisions, what could be a better way? One potential remedy that's noted by both Steven Goodman [2] and Nassim Taleb [3] is to rely on a Bayesian approach, which has a number of advantages, including interpretability [2,4] and the possibility to report probabilities for minimum uplifts (based on a complementary cumulative distribution function).

So, what did we do? We did some good ole user research and talked to stakeholders about our reports and what they needed to make informed decisions. For all of the reasons stated above, we got rid of the notion of significance altogether. Instead, together with our partner, we started using Bayesian inference to calculate minimum uplifts for certain probabilities. Additionally, using a formula from our controlling department, we translated all the rather abstract conversion rates and average order values into expected added revenue per month. That is, at its core, our reports read, for example, "With a probability of 95 percent, we'll make ≥10,000 euros a month; with a probability of 90 percent, we'll make ≥15,000 euros a month," and so on. Now, without some magical threshold to rely on, our stakeholders had to actively decide whether they deemed the probability of a certain minimum additional revenue high enough to justify implementation of the corresponding treatment. They could calculate an expected ROI and make an informed decision based on that.

I don't mean to say that we invented something new. There are already plenty of A/B testing tools going the Bayesian way, including Google Optimize, VWO [6], iridion, and Dynamic Yield, to name just a few. And yet, there are some—and many experimenters—who *still* blindly rely on notions of significance, which just doesn't make sense. There is no sudden change in the quality of a result solely because it passes an arbitrary threshold [3]; and it is perfectly fine to conclude that in a given situation, 80 percent is a high enough chance to make an additional 500,000 euros a month.

The more I think and read about this topic, the more I'm convinced that p-values and significance are utterly ill-suited tools for A/B testing. They're difficult to understand, interpret, and communicate; unreliable; prevent perfectly good treatments from being implemented; and there's a better alternative. Therefore, we should get rid of them.

1. Google. Optimize Resource Hub—for those used to frequentist approaches. 2021; https://support.google.com/optimize/answer/7404625#dont-significance-and-p-values-tell-me&zippy=%2Cin-this-article

2. Goodman, S. A dirty dozen: twelve p-value misconceptions. *Seminars in Hematology 45*, 3 (2008), 135–140; http://www.ohri.ca/newsroom/seminars/SeminarUploads/1829%5CSuggested%20Reading%20-%20Nov%203,%202014.pdf

3. Taleb, N.N. A short note on p-value hacking. *arXiv preprint arXiv:1603.07532*, 2016; https://arxiv.org/pdf/1603.07532.pdf

4. Amrhein, V., Korner-Nievergelt, F., and Roth, T. The earth is flat (*p* > 0.05): Significance thresholds and the crisis of unreplicable research. *PeerJ 5* (2017), e3544; https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5502092/

5. Nielsen, J. Handling insignificance in UX data [Video]. YouTube, 2021; https://www.youtube.com/watch?v=QkY8bM5bAOA

6. Stucchio, C. Bayesian A/B testing at VWO. Whitepaper. Visual Website Optimizer, 2015; https://www.chrisstucchio.com/pubs/VWO_SmartStats_technical_whitepaper.pdf

**Maximilian Speicher** is a computer scientist, designer, researcher, cofounder of UX consulting firm Jagow Speicher, and a ring tennis player. Currently, he is director of product design at BestSecret. His research interests lie primarily with novel ways to do digital design, usability evaluation, augmented and virtual reality, and sustainable design. [email protected]

Copyright held by author

The Digital Library is published by the Association for Computing Machinery. Copyright © 2022 ACM, Inc.

## Post Comment

No Comments Found