The human understanding when it has once adopted an opinion (either as being the received opinion or as being agreeable to itself) draws all things else to support and agree with it. And though there be a greater number and weight of instances to be found on the other side, yet these it either neglects and despises, or else by some distinction sets aside and rejects; in order that by this great and pernicious predetermination the authority of its former conclusions may remain inviolate. And therefore it was a good answer that was made by one who when they showed him hanging in a temple a picture of those who had paid their vows as having escaped shipwreck, and would have him say whether he did not now acknowledge the power of the gods, “Aye,” asked he again, “but where are they painted that were drowned, after their vows?” And such is the way of all superstition, whether in astrology, dreams, omens, divine judgments, or the like; wherein men, having a delight in such vanities, mark the events where they are fulfilled, but where they fail, though this happen much oftener, neglect and pass them by. But with far more subtlety does this mischief insinuate itself into philosophy and the sciences; in which the first conclusion colours and brings into conformity with itself all that come after, though far sounder and better... —Francis Bacon, First Book of Aphorisms, 1620
Confirmation bias is built into us. Ask me to guess what a blurry image is, then bring it slowly into focus. When it has become clear enough to be recognizable by someone seeing it this way for the first time, I will still not recognize it. My initial hypothesis blinds me.
Confirmation bias and its underlying mechanisms helped us survive. A rough pattern of colors that correlated with past sightings of saber-tooth tigers was a good reason to run. Sticking around to obtain statistically reliable proof did not aid survival. Eating something and becoming ill was a good enough reason to avoid it despite the occasional false conclusions, such as "tomatoes are poisonous." And belief in omens and divine judgments probably helped people endure lives that Bacon’s contemporary Thomas Hobbes described as “nasty, brutish and short.”
To get through life efficiently, we infer causality from correlational data without working out all possible underlying factors. “This intersection was slow twice, I should avoid it.” “Wherever wolves are thick so are wildflowers, so wolves must like flowers.” “Two Freedonians let me down, Freedonians are unreliable.” Confirmation bias underlies stereotyping: Having decided they are unreliable, a reliable Freedonian is an exception, another unreliable Freedonian is a confirmation.
Bacon realized that deep understanding requires a higher bar. He is credited with inventing the scientific method to attack confirmation bias. Unfortunately, experimental methods help but do not overcome the power of confirmation bias, which remains the primary impediment to advancing scientific understanding. It affects all our research: experimental, systems work, design, quantitative analysis, and qualitative approaches.
Confirmation bias arising with experimental methods
Science is not well served by random experimentation. Clear hypotheses can help. For example, Bacon hypothesized that freezing meat could preserve it. But hypotheses have unintended consequences. He contracted pneumonia while doing the experiment and died. The less severe but more common problem is that hypotheses invite a bias to confirm and thereby miss the true account: the initially blurry image that isn’t recognized after we hazard a wrong guess as to its identity.
Approach hypotheses cautiously. In overt and subtle ways, researchers shore them up and ignore disconfirming evidence. We rationalize excluding inconvenient “outlier” data or we collect data until a statistically reliable effect is found. We don’t write up experiments that fail to find an effect, perhaps for good reason: It is all but impossible to publish a negative result. An outcome that by statistical fluke appears to confirm a hypothesis is published, whereas robust findings disconfirming it, though this happen much oftener, we neglect and pass by. It’s a severe problem. Simmons et al.’s Psychological Science paper “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant” demonstrates that by selective disclosure of methods, it is easy without fraud “to publish ‘statistically significant’ evidence consistent with any hypothesis.” Bakker and Wicherts found many statistical errors in a large sample of journal articles, with almost all errors favoring the experimenters’ hypotheses.
Behavioral studies face worse challenges. “Demand characteristics” elbow in: Everyone colludes, consciously or unconsciously, when they know what the researchers believe. In one experiment, lab assistants were told that a set of rats had been bred for intelligence and found they learned mazes faster than normal rats—but all were actually normal rats. It was unclear why. Perhaps the lab assistants handled the “genius rats” more gently.
In HCI studies, friendly human participants often discern our wish and help us out. Double-blind studies where researchers step back and experiments are run by assistants ignorant of the hypotheses or the conditions can sometimes counteract this. But they require more work.
Sometimes there is no need for an hypothesis, as when a researcher with no preference compares two design alternatives. Alternatively, multiple hypotheses can be tested in a study; one or two can be disconfirmed, yielding an aura of scrupulousness. However, I have never seen a study in which all of the hypotheses were disconfirmed. Researchers want to appear smart; key hypotheses tend to be confirmed.
My colleague Steve Poltrock notes that “marginally significant” findings or “trends” are often used to support hypotheses when accepted statistical measures fail to do so. People rarely report trends that counter their hypotheses, citing such differences as “not significant.” It’s human nature.
Design and systems studies
Researchers in our field typically test their own designs and prototype systems. Study participants know that designers hope their designs are liked. They know that system builders hope the systems are liked. Papers invariably report that the designs and systems were judged to be promising. Yet all discussion disappears when subsequent experience proves disappointing. Our literature is full of promising prototypes that disappeared without explanation. This is a disservice to science and engineering. My own studies of promising prototypes are no exception. I once tried to publish a post-mortem of a failure but could not get it accepted, and with only the early positive reports in the literature, for years we received requests for the prototype code. The difficulty of publishing negative results is well known at the National Science Foundation and elsewhere, but I know of no efforts to address it.
Quantitative analysis and ‘predictive analytics’
Quantitative studies are no guard against these problems. In fact, they often exhibit a seductive form of confirmation bias: inference of a causal relationship from correlational data, a major problem in conference and journal submissions I have reviewed over the years. The researchers hypothesize a causal relationship, the correlational data fit, and the researchers take it as proven. Equally or more plausible causal models are ignored, even when plainly evident.
Suppose that heavy Twitter use correlates with being promoted in an organization. It may be tempting to conclude that everyone wanting promotion should use more social media. But causality could run the other way: Maybe already-successful employees use Twitter more, and incessant tweeting is a path to demotion for average employees. Or perhaps gregarious people tweet more and get promoted more. There is no causal explanation in the correlation. This is not fanciful; the literature is packed with such examples. Smart people seeing evidence consistent with their case do not look for alternative causal models.
The word predictive, often used to indicate a positive correlation, causes further problems. Predict has a strong causal connotation for the average reader, and researchers themselves often slide down the slope from “A is predictive of B” to “A causes B.”
As a father, one of my goals is to raise my children to distinguish correlation and causation. Unjustified causal assumptions about correlated events are so common that anyone who avoids them will find ways to be useful. Often we make the correct causal inference, and when dodging a possible saber-tooth a false alarm may be a modest price to pay for survival. However, problematic errors arise frequently in science, engineering, and everyday life.
After describing an ingenious quantitative analysis to explain patterns in “big data,” a conference presenter expressed frustration with a pattern that defied analysis. Someone suggested that the researcher simply contact some of those whose behavior contributed to the pattern to ask what was going on. His reply, in so many words, was “That would be cheating!” Clever quantitative analysis was his goal.
Especially today, with quantitative data so readily available, qualitative field studies are often dismissed as anecdotal and especially prone to confirmation bias. And in fairness, many studies claiming to be ethnographic are weak and good work faces challenges, as described below. So let me explain why I believe that qualitative research is often the best way to go.
I hold degrees in math, physics, and cognitive psychology. I appreciate all efforts to understand behavior. I drew distinctions between scientific approaches and those of history, biography, journalism, fiction—and anthropology. I felt that science involved formal experiments, controlled usability studies, and quantitative analysis. Then, in the mid-1980s, I read a short paper by Lucy Suchman, “Office Procedures as Practical Action,” that described the purchasing process at an unnamed company. A purchase order form was filled out in triplicate, whereupon the Purchasing department sent copies to Receiving and Finance. When orders arrived, Receiving sent an acknowledgment to Finance. When an invoice arrived, Finance found the order and receipt and cut the vendor a check. Very methodical! Suchman then said that the process is not routine and routinely requires solving problems and handling exceptions that arise. She included the transcript of a discussion between two people struggling with a difficult order, showing lots of inference and problem-solving. End of article.
I was shocked. How could a scientific journal publish this? A rational organizational process that I was sure usually worked smoothly, and she presented this one pathological case. I marched to the office of someone in Purchasing in in my organization and asked her to explain our process. She said, “Someone fills out an order in triplicate, we send one copy to Receiving, the other to Finance. When the goods arrive…” And so on.
“Right,” I said. She looked at me. “That’s how it works,” I said.
She paused, then said, “Well, that’s how it’s supposed to work.” I looked at her. “But it never does. Something always goes wrong.”
I held out a copy of Lucy’s paper and asked, “Would you read this and let me know what you think?” The next day she told me, “She’s right. If anything, it’s worse than she said. Some exceptions happen so often we call them the standard exceptions, and then there are exceptions to the standard exception.”
“Thanks,” I said. I got it. Anthropologists are trained to avoid cherry-picking. You can’t spend two years describing a two-year site visit or two weeks describing a two-week study, so you rely on representative examples. Their methods can include copious coding and analysis of observations and transcripts. Some anthropologists are better than others. The approach might seem less foolproof than controlled experiments, but there is a method, a science. I started doing qualitative work myself.
The BBC drama Elizabeth I portrays a queen contending with chaotic scheming. A minor character, Francis, offers occasional thoughtful guidance. At one point someone refers to him as… Bacon! The apostle of scientific method in a setting devoid of evidence-based decision-making!? Actually, one of Bacon’s great contributions employed qualitative field research. Oxfordshire yeomen rebelled in the late 16th century. The customary response to an insurrection was suppression by force, but Bacon investigated and found them starving, forced off traditional farmland by aristocrats enclosing the land to create private hunting grounds. Powerful figures in the House of Lords insisted that this was a right of landowners, but Bacon pushed through and defended measures that preserved traditional access to land. (I highly recommend Nieves Mathews’ fascinating account of Bacon.)
In my research, when an hypothesis emerges, an explanation for patterns in the data, a constant priority is to find alternative explanations and disconfirming data. In presentation, Rob Kling noted that careful writing is more important in qualitative research because one word can make a huge difference: “X can lead to Y” or “X often leads to Y” is not the same as “X leads to Y.” Experimental and quantitative methods produce data that are reported in the paper; readers can consult the data. Discussion can be looser. A good qualitative report requires careful, honest selection of data and artistry in presentation to paint a picture for the readers.
The challenge is amplified when the researcher is expected to adopt a theoretical framework, to “build theory.” This invites selective filtering of observations. Another risk is “typing,” when a researcher becomes known for a particular observation or perspective, increasing the desire to confirm it. Some good qualitative researchers become predictable in what they report in each study. Sometimes other aspects of the situation seem central to understanding yet were not stressed; other times it isn’t noticeable but could be the case.
Can anyone undertake a study free of hypotheses? At an uninteresting level the answer is no—we all believe things about the world and people. But a better answer is often yes, we can minimize expectations. An ethnographer could study a remote culture assuming just that it is of interest to do so, or assuming that there is a complex kinship system that should be winkled out. The latter risks discovering something that is not there or missing something of greater interest. Similarly, we can examine the use of a new technology believing that it is likely to be interesting, or we can come in with preconceptions about how it will be used. The stronger the preconception, the greater the risk.
One way to approach this is grounded theory, a set of approaches that advocates minimal initial hypothesizing and the collection and organization of data in search of patterns that might form a foundation for theory. When a possible pattern is detected, the researcher seeks observations that do not fit, a step toward a richer understanding. Grounded theory has its detractors. It may not appeal to people for whom theory is where the fun resides. But it is the best fortress I know from which to defend against confirmation bias.
I’m not immune to confirmation bias, although I’m generally not so confident in any hypothesis to resist seeing it disconfirmed. For example, I think of HCI as pre-theoretical, but rather than confirming that bias by ignoring or attacking all theories, I consider them and sometimes find useful elements. Years ago, I was dismayed to find data that didn’t fit a cherished pattern, but eventually came to love disconfirming data, which is a necessary step toward a more complete understanding.
Am I biased about the importance of confirmation bias? I’m convinced that we must relentlessly seek it out in our own work and that of our colleagues, knowing that we won’t always succeed. Perhaps now I see it everywhere and overlook more significant obstacles. So decide how important it is, and be vigilant.
Jonathan Grudin is a principal researcher in the Natural Interaction Group at Microsoft Research. This post had unusual help for a non-refereed paper: Franca Agnoli, Steve Poltrock, John King, Phil Barnard, Gayna Williams, and Clayton Lewis identified relevant literature, missing points, and passages needing clarification.