Henriette Cramer, Juho Kim
AI research has now been around for about 65 years, and the consequences of design decisions on AI outcomes have been a lively debate for 20-plus years , if not longer. Governments, companies, and investors are now pouring in copious resources to advance AI techniques and create “AI-powered” products. Amid the hype, however, people question whether breakthroughs are reproducible and transferable to practice , and who benefits from them [3,4]. Keeping up with the latest trends has become increasingly challenging, even for the experienced. And the definition of accepted terminology itself is ever changing. As we—HCI researchers and UX practitioners—struggle to keep up with where the field is going, it is easy to lose sight of its past, repeat mistakes, and stumble on unintended consequences.
We may have learned lessons from collectively building algorithmic systems for billions of users, but the field—where UX meets AI—is full of tensions. We are grappling with daily challenges, including classic but unsolved themes such as transparency, user control, calibrated trust, and fairness. That is why, following up on the earlier Interactions special topic on designing for AI , we are now introducing a more permanent forum to invite lessons learned on the intersection of UX and AI, from academia, industry, government, and art practice.
Looking at the intersection of UX and AI, a variety of tensions stand out that need to be navigated. Here we discuss just a few as starting points.
Tension 1: Incentivizing holistic success. Most practitioners have to juggle a variety of goals by users, content creators, and business decision makers. Turning machine learning (ML) experimentation into actual practical impact in itself can be amazingly challenging and requires the successful integration of ML and UX practice.
ML practitioners are typically trained to find the best model that maximizes some utility. In reality, however, there is rarely a single objective to be maximized, with many that cannot be easily quantified. Advances that are unremarkable from a modeling standpoint could be valuable to practitioners. Systems can even be too good, when system predictions go much beyond what is expected or acceptable. Making it work shouldn’t be just about improving performance on a benchmark or optimizing for metrics like precision, recall, or click-through rate. It should also be accompanied by better outcomes for people, less harm, and, in the end, learning as a collective discipline. The UX of AI ultimately revolves around how to make modeling outcomes successful, but it also means learning what success actually means.
Tension 2: Embracing the bad: sharing failures and collective learning. Anyone or anything that learns also inevitably fails. In the context of AI-based art, the failures of the generative models used are exactly what make them productive for creativity, but people need to curate these failures into meaningful art . Perhaps we similarly need to collect failures in a productive way and put them on display.
When we call out the pitfalls of ML, we also need to find ways to practically mitigate them. We are seeing a very important uptick in calls to action to consider algorithmic responsibility and accountability, including the environmental consequences of using computationally intensive methods. However, this is creating a flurry of new frameworks and guidelines. Not only do we see a risk of an AI winter because of hyperbolic promises about the technology, but a responsibility winter due to practitioner fatigue is also not unimaginable. The Berkman Klein Center, for example, counted more than 30 sets of guidelines for rights and ethics-based approaches to AI (https://ai-hr.cyber.harvard.edu). Keeping up with all of these can be a full-time job and is still not sufficient to handle the messiness of practice.
Anticipating harms, including those that in hindsight are obvious, requires techniques that promote what-if foresight—and not just of the ML-prediction kind. It is often organizational skills for communicating effective critique that actually influence decision makers. Gaining those navigational skills requires learning from case studies and real-world experiences. A recent panel at Data4BlackLives , for example, provides engaging examples of how data-based predictions about climate needed translation into concrete local action with and by communities that are most likely to experience the negative impact. Collectively, we can get further by sharing these experiences.
Tension 3: Invisible layers of (unavailable) tooling. Within the confines of the system, there is seldom one model; we interact with layers of models, services, companies, and infrastructures built on top of each other. Kate Crawford and Vladan Joler’s “Anatomy of an AI System”  is an artful critique that visualizes the complexity of human labor, data flows, and resources used for building a voice-activated speaker. Each of the many stages in the ML lifecycle needs human judgment to define the right approach, assess quality, and override obvious issues . Who is involved in these (non-) decisions, and how they are supported or incentivized, matters deeply. This turns product and engineering teams, crowdworkers, editors, and data curators into users themselves, who need explanations of what works and doesn’t, and methods to intervene and override systems. Any assessment and intervention requires dedicated infrastructure and tooling. This raises considerable UX challenges, including how common work practices such as agile product development could be adjusted.
Trying to explain all the layers of a system to all possible stakeholders may not be the right, nor the most feasible, approach here. Initiatives like the People & AI Guidebook by Google PAIR (pair.withgoogle.com) are useful, as they educate UXers about human-centered concerns. However, general guidelines might not directly translate to the specific domains and work practices that teams encounter in their own daily practice. This requires communicating tangible examples of UX decisions made and their impact in the real world.
Tension 4: Feedback loops feeding back into other feedback loops into other feedback, etc. The history of the UX decisions that intervene in the feedback loop between users, their communities, and systems isn’t neatly documented. Take, for example, explicit feedback features used in commercial products. Twitter changed from favorites to likes; Facebook added reactions beyond the like button; and Netflix changed its stars to thumbs up and down. While articles and blog posts exist to announce such changes, externally we may never know the full history behind these decisions and many less visible ones. This leaves many teams to learn these lessons themselves.
Similarly, when looking at calls for explainable AI and experiments conducted in practice, many are not necessarily informed by decades of research on different types of explanations. Academic classics like Herlocker et al.‘s recommender explanations paper  are staples for researchers but have rarely been translated for most UX professionals who currently design explanations. Rewarding the deliberate sharing of changes and their consequences would prevent others from making incorrect inferences about why something changed and wrongly applying a solution to their context.
Tension 5: How much AI is OK? Determining the right level of separation of work between AI and humans is a practical issue in many projects. In a recent research project one of us (Juho) has worked on, our team explored the feasibility of AI moderators in chat-based online discussion. Our team attached a natural language understanding unit to the system to infer user intent and discussion state, and tried to recommend contextually appropriate moderation messages to everyone in the chat. After extensive development and pilot testing, we soon realized full automation is not the right approach. It was not just a matter of accuracy of inference but also other aspects that were more crucial, such as how much trust users have in the machine moderator (no matter how accurate), how naturally moderation messages are delivered, and how customizable the automatically generated messages are.
We pivoted the project to focus on helping human moderators with real-time recommendations for moderation, and it ended up being the right balance of human work and automation. In many human-AI collaboration projects, we see a similar pattern: starting with an overestimation of how much AI can do, realizing the limitations of full automation, adding more human work, and finding the right balance. A related challenge is that, as model performance keeps improving and user expectations keep changing, determining “how much AI is OK” becomes a moving target. While more case studies of human-AI collaboration and general guidelines are published, we need more documented examples and effective methodologies for designing successful human-AI collaboration.
Tension 6. AI literacy and power imbalances. Many ML algorithms are hard to explain, creating imbalances in understanding how to influence results. This becomes particularly challenging when high-stakes decisions are made by these algorithms. Examples include companies using algorithms to make hiring decisions, governments applying ML to make resource-allocation decisions, and police using facial recognition technology to identify suspects. Initiatives like the Campaign to Stop Killer Robots point to the consequences of the imbalances of who does and does not have the pathways to influence algorithmic outcomes. Recently, protesters in Hong Kong have contended with facial recognition technology and smartcity infrastructure. Search-engine optimization and online ad bidding are day-to-day examples of people coming up with clever strategies that influence algorithmic performance to meet their goals. All require AI literacy.
There’s often tension or incentive misalignment between the side that deploys AI (e.g., companies, organizations, and governments) and the recipient of AI-powered outcomes (e.g., users and citizens). This also applies to efforts to democratize ML techniques. While techniques have become much more accessible, developers with little training and awareness can easily deploy potentially harmful but powerful products. San Francisco may have banned facial recognition usage by police or municipal agencies earlier this year, but that doesn’t stop developers from building for private purposes. Awareness of the severe consequences of misuse requires both knowledge sharing and community organization.
Just like the field where UX and AI meet, this new forum is yet somewhat ambiguous. We need your experiences to shape it and to shape the discipline. If we as a community do not document, translate, and share our lessons learned, AI-powered products will fail to serve user needs and neglect the potential negative impact of technology. We welcome your lessons on how to address all tensions and gaps between research and daily practice.
We are especially interested in submissions that explore the integration of UX and AI that involves deliberate decision making. Similarly, we welcome examples where a potentially problematic tension was turned into positive impact, and stories that speak frankly about failures or organizational lessons that wouldn’t be published in regular research venues. While these may appear tactical or incremental, they are often of crucial importance to teach practitioners what (not) to do and what results to expect, and to inform researchers about problems worth tackling.
We owe a big thank you to Elizabeth Churchill, Sam Way, Jasmine McNealy, Mat Budelman, Carrie Cai, and Minsuk Chang for great feedback and conversations on the direction of this forum. And you, lovely reader and future writer for this forum, we invite your input to help shape this field.
2. Dacrema, M.F., Cremonesi, P., and Jannach, D. Are we really making much progress? A worrying analysis of recent neural recommendation approaches, Proc. of the 13th ACM Conference on Recommender Systems. ACM, New York, 2019, 101–109.
3. Battle, C.P., Gunder, V., Ross, D., Rice, L., and Venkataraman, B. (2019) “The seas are rising but so are the people”: Data, disaster & collective power. Panel at Data4BlackLives, https://youtu.be/iLBzLY1MEgA.
4. Crawford, K. and Joler, V. Anatomy of an AI system. 2018; https://anatomyof.ai
7. Vaughan, J.W. and Wallach, H. Machine learning lifecycle & fairness. Microsoft Research Webinar. 2019; https://note.microsoft.com/MSR-Webinar-Machine-Learning-and-Fairness-Registration-LIVE
Henriette Cramer is a principal research scientist and leads Spotify’s algorithmic responsibility effort. She is particularly interested in how design and organizational (non-)decisions impact algorithmic outcomes, and pragmatic ways to translate between research and unwieldy practice. She has worked on natural language interactions, recommendation, and ad applications at Spotify and Yahoo, and investigated location-based interactions and human-robot interaction at SICS. She holds a Ph.D. from the University of Amsterdam. firstname.lastname@example.org
Juho Kim is an assistant professor in the School of Computing at KAIST, where he directs KIXLAB. His research focuses on building interactive systems that support interaction at scale and empower collective action, specifically by exploring ways to combine human and machine intelligence. Online education, civic engagement, and online discussion have been the main application domains of his work. He earned his Ph.D. from MIT. email@example.com
Copyright held by authors. Publication rights licensed to ACM.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2019 ACM, Inc.