Authors:
Meredith Ringel Morris
The past few years have seen rapid advances in frontier AI models, demonstrating increasing performance and generality. As progress continues toward artificial general intelligence (AGI), HCI scholarship and practice has a critical role to play in ensuring that AI technology is useful to and usable by people to accomplish tasks they value. HCI insights can help us maximize the benefit of AI technologies to individuals, communities, and society while allowing us to understand how to mitigate harms. In this article, I describe a research vision for the field of HCI in the AGI era, examining how HCI researchers can innovate in interaction techniques, interface designs, physical form factors, design methods, evaluation methods, benchmarking approaches, and data collection techniques.
Interfaces for advanced AI systems are in their infancy. While ChatGPT demonstrated that even simple user interfaces are powerful in making models accessible to a broad user base, prompt-based systems remain challenging for most users to harness effectively [1]. This has resulted in the emergence of new areas of expertise, such as prompt engineering, and the creation of platforms like prompt marketplaces. Developing rich multi- and cross-modal, responsive interaction techniques that are learnable yet powerful is a vital area for HCI research.
→ The advent of increasingly powerful AI systems requires new research agendas for HCI and innovations in HCI/UX practice.
→ We must update and innovate design and evaluation methods for AGI systems and their precursors.
→ HCI research can support the creation of useful, usable AGI systems.
For example, advances in machine learning (ML) may mean that noninvasive, high-bandwidth brain-computer interfaces (BCIs) are feasible within a few years [2]. These technologies are currently nascent. Therefore, common HCI approaches like "Wizard of Oz" studies and other simulation techniques will be crucially important in investigating how to integrate BCIs into human-AI interactions in order to strategically position and harness such techniques as they mature in neuroscience labs.
In addition to designing and evaluating explicit human-AI interaction techniques such as BCIs, another important area for innovation and analysis is around implicit interactions. Future AI systems will have access to a wide range of data sources. Examples include capturing real-world contextual information, such as data about the location and identity of objects and people, casual gestures, facial expressions, and prosody, as well as biometrics conveying affective and health data and myriad other content that might implicitly signal user intent. Systems will also likely have access to large interaction and preference histories from specific users and user groups, as well as various forms of explicit input such as text, voice commands, sketches, and intentional gestures. Understanding how to best combine explicit and implicit interactions and when and how to personalize systems based on user models or histories remains a challenge. For example, it is unclear how an AI system should weight the relative importance of each of potentially dozens of implicit input signals. The answer is likely highly dependent on the task and user preference factors that make learning general principles difficult. While it may be possible for future AI systems to determine optimal, per-instance weighting of such inputs, such systems may not be interpretable by end users, developers, or other stakeholders, who may value such explanations.
HCI scholarship and practice has a critical role to play in ensuring that AI technology is useful to and usable by people to accomplish tasks they value.
Key issues that HCI researchers and practitioners have long grappled with are critical, especially the balancing of considerations like learnability, usability, accessibility, and privacy. For example, it is likely that future advanced AI systems could automatically create highly personalized interfaces for particular on-demand tasks based on knowing users' preferences, abilities, and context. Such bespoke interfaces would likely reduce cognitive load and other sources of friction for an end user, but it would require the user to share large amounts of sensitive data with an AI system, such as disability status and detailed history of past interactions. In addition to these individual-scale considerations, we also have to consider societal-scale impacts of highly personalized advanced AI, including the trade-offs of individual benefits of personalization—learnability, accessibility, and accuracy, for example—versus potential societal drawbacks (i.e., filter-bubble-type siloing effects that extend beyond the media ecosystem).
In addition to developing novel interaction techniques, it is necessary to conduct research that can support usability by bridging the gulfs that emerge between AI systems and end users. In addition to Don Norman's classic gulfs of execution and evaluation (see sidebar 1), generative AI systems introduce a process gulf because they execute tasks in manners that tend to differ from human problem-solving approaches and that are typically uninterpretable by end users and often even by ML experts (see sidebar 2). For instance, diffusion models used in popular generative image tools like Stable Diffusion create pictures by iteratively refining noise, which is quite different from the process a human artist typically takes. Key research challenges lie in developing novel approaches to bridge the gulf of execution (that is, conveying the rich affordances of AGIs to end users and ensuring clarity in specification of their intent to the system), the process gulf (for example, through interfaces that support interpretability of and control over the means by which AGIs execute tasks), and the gulf of evaluation (i.e., by helping end users validate correct task completion, even for tasks that may be beyond the limits of their personal knowledge). Indeed, one challenge for the HCI community to consider is that it may be theoretically impossible for AGI systems and their sufficiently advanced precursors to bridge gulfs in human-computer interaction [4]. For instance, any sufficiently general AI system might have near-infinite affordances, making it difficult or impossible to fully bridge the gulf of execution; minus significant advancements in mechanistic interpretability research, it remains impossible to fully bridge the process gulf for most modern ML models. Given that advanced AI systems will be used for many tasks that are beyond the skills of average or even expert humans, it may not be possible to fully bridge the gulf of evaluation since end users may not be able to determine if their task was accomplished correctly. Research into how to design interfaces that fully or even partially bridge these gulfs for AGI systems and their precursors would enhance alignment (see sidebar 3) by supporting clarity in user interactions, which would not only increase usability but also support AI safety objectives.
Furthermore, it is essential for us to study the impact of interface metaphors such as anthropomorphism on users of advanced AI systems. As we progress toward AGI, it will become increasingly tempting for end users to anthropomorphize more-powerful systems, particularly those that employ agentic interaction styles. Anthropomorphic interfaces may offer benefits, such as helping users form a mental model of an AI agent's affordances or even offering a delightful and engaging user experience. On the other hand, anthropomorphic interfaces may increase risks, such as leading users to overtrust an AI system or even promoting unhealthy parasocial relationships. It is incumbent upon us to investigate these issues so that we can make informed and responsible decisions about whether and how to employ anthropomorphic interfaces. It is vital that such studies analyze realistic, long-scale deployments, since many impacts of anthropomorphic designs such as addictiveness, inappropriate attachment formation, or mental health concerns may not emerge in the time frame of lab evaluations [6]. Such investigations should also consider the likely differential impacts of anthropomorphic AI on vulnerable user groups, such as children, older adults, people with lower levels of formal education, people with intellectual disabilities, neurodivergent users (e.g., people with autism), and people experiencing mental health challenges, such as depression or anxiety. In addition to considering the impact of design metaphors for end users with particular demographic, cognitive, and mental health characteristics, it may also be important to understand how transient contextual factors may alter the impact of interface metaphors. For instance, a user who has recently lost a loved one may react very differently to an anthropomorphized AI agent, or "generative ghost" (https://arxiv.org/abs/2402.01662), modeled on the deceased individual than someone whose bereavement is further in the past.
Another opportunity for HCI is to design the next generation of physical form factors for emerging AI systems. Some start-ups have proposed wearable devices, such as the Humane Ai Pin, which provide an always-on interface to an AI assistant that is gathering environmental context via sensors. While Humane's pin product was a market failure—in part because of what seemed to be a lack of testing for or perhaps prioritization of users' needs in deployed settings (https://bit.ly/3Zp6MTx and https://bit.ly/4iqGE3F)—it draws attention to the provocative question of what novel form factors will support seamless and delightful interactions with future AI-powered systems while not detracting from our human-human interactions.
A key design challenge for HCI researchers is envisioning form factors that can support natural human-AI interactions without negatively affecting human-human interactions. Mobile phones are an example of a design that works well for supporting individual users' needs, but detracts from colocated group interactions, reducing eye contact and shared points of reference among group members absorbed in individual screens. How might we design form factors for AI systems that anticipate and avoid negative social contingencies? Output modalities such as audio and haptic feedback and interface paradigms like augmented reality and tangible computing—all long-standing subjects of inquiry and innovation HCI—might form the building blocks of solutions to these problems, perhaps in combination with novel interaction methods and hardware.
One open question is whether HCI's numerous design methods for ideating and iterating require extension for the era of advanced AI. For instance, the previous section mentioned the challenge of understanding how advanced AI deployed at scale might affect social interactions, and how changes in interface design might mitigate such impacts. Are the existing tools in our methodological toolbox, such as low-fi prototyping, Wizard of Oz techniques, participatory and community-based design, and speculative design, up to the challenge of helping us understand complex Nth-order effects of future technologies over time? Are there new approaches, perhaps including AI tools as design partners, that might expand our creativity in solving complex design challenges? For instance, generative AI tools might help us reflect on a range of unintended negative side effects of our designs [7].
As an example of a new area for developing design methodology, I believe HCI experts should develop a formal approach for what I call green teaming (a riff on the notion of red teaming in the AI safety community, in which human experts, crowd workers, and adversarial models test whether they can get models to exhibit a variety of harmful behaviors). Given that AGI systems and their precursors can theoretically accomplish a near-limitless range of possible tasks, green teaming methods could be used to help us understand how to steer interface designs in directions that are most useful. Just as red teaming helps discover previously unknown harms of AI models, green teaming would be a method to help uncover previously unanticipated positive applications of AI. For example, green teaming methods might help us quickly iterate on AI systems by understanding surprising ways in which people are attempting to use them, and adjusting models and interfaces to better support those "unknown unknown" use cases. Green teaming methods might build upon best practices from HCI, such as iterative prototyping, experience sampling, diary studies, log analysis, field deployments, focus groups, and other methods, and expand these techniques to support replicability, scaling, and hybridization of human-in-the-loop methods with AI-in-the-loop methods to increase the power and scope of green teaming to reveal novel design directions worth pursuing.
A key design challenge for HCI researchers is envisioning form factors that can support natural human-AI interactions without negatively affecting human-human interactions.
An existential question for our profession, deserving of systematic study, is the extent to which current human-powered design processes can (or cannot) be automated by near-term or future AI. For example, it is not unreasonable to hypothesize that among the many capabilities of an AGI would be the ability to create a highly optimized user interface for any given task and user combination. What are the right approaches for evaluating the strengths and limitations of AI-powered design versus traditional methods or hybrid methods, in which designers work in tandem with powerful AI tools? Are we ready as a field to yield some aspects of design to machines if the results of such studies reveal AI design to be optimal in some circumstances? While this may sound radical, consider how tools like GUI tool kits obviated the need for designers to custom craft widgets, such as buttons and menus in many circumstances, introducing efficiencies and opening up the energies of HCI to focus on novel challenges. AI-automated design solutions may be a similar step change for the field of interface development.
The science of synthetic evaluation—that is, using generative AI to simulate users—is a crucial, and controversial, new frontier for HCI. While many are skeptical (and rightly so) about the claims of many new start-ups that they offer synthetic user testing (e.g., Synthetic Users, OpinioAI, and User Persona), it is clear that synthetic usability testing is an important and growing trend. While some start-ups in this space may be overpromising and underdelivering at this relatively early stage of AI capability, there is debate within our field over whether more sophisticated and high-quality synthetic evaluation methods are functionally and even morally appropriate to employ. For instance, Albrecht Schmidt et al. presented the pragmatic argument in favor of AI-powered evaluations in a recent Interactions piece [8], noting, "Involving people is not an end in itself…. If we can create a system with the same or a higher bar of quality using AI models, we need not involve people." In contrast, a recent CHI article argued that synthetic usability presents an "Illusion of Artificial Inclusion" [9] that may undermine our values as a human-centered profession.
One key benefit of synthetic evaluation methods for novel AI systems may be safety—the ability to test technologies in simulation, with synthetic agents, to learn of and mitigate potential harms without the need to expose people to risky, early-stage technologies. Another advantage is the potential to scale testing beyond what can be accomplished in a lab, as well as reducing monetary and temporal costs associated with conducting evaluations. Additionally, synthetic evaluation approaches may ultimately lead to the development of more-inclusive systems by allowing simulation of diverse participants, including those from difficult-to-recruit demographics or with intersectional identities; this is particularly important for mitigating potential inequities that may inadvertently arise from novel AI technologies.
HCI can lead in developing a rigorous science of synthetic evaluation, including deeply understanding the pros and cons of various synthetic usability approaches and whether and how they should be combined with more-traditional evaluation techniques. For example, we may discover that some evaluation methods, such as heuristic evaluation, lend themselves well to automation. Others, such as usability testing, might do so only with caveats. What standards might we need to establish as a field to lend confidence to the findings of synthetic user studies? Might we require such studies to include data, code, and other artifacts like prompts and generative agents' seeds so that other researchers can inspect and replicate them? What, if any, differences arise when conducting a given class of evaluation with humans, with completely synthetic AI agents (i.e., seeded from persona descriptions), or with AI agents modeled on specific, real people? Does synthetic evaluation work equally well for all user groups, or does it perform differentially for WEIRD (Western, educated, industrialized, rich, democratic) constituencies who may be better represented in models' training data? Developing a benchmark of classic user studies to help us evaluate the pros, cons, and progress of synthetic evaluation in a systematic way could help our community accelerate and standardize this emerging area of practice.
Status quo ML benchmarks tend to focus on tasks that yield a quantitative metric and are fully automatable (e.g., BIG-bench, https://github.com/google/BIG-bench; MMLU, https://github.com/hendrycks/test; etc.). For example, the MMLU (Massive Multitask Language Understanding) benchmark includes 57 questions compiled from publicly available challenges (such as the GRE graduate school exams) in a variety of domains (history, law, math, etc.). Answers can be submitted to automated scoring systems, and leaderboards track how well various models perform on MMLU and other similar compendiums of tasks (https://bit.ly/4eZVcnH). Some benchmarks (e.g., LMSYS, https://chat.lmsys.org/; SEAL, https://scale.com/leaderboard; and portions of HELM, https://crfm.stanford.edu/helm/) also include interactive components involving human ratings, but still focus on quantitative feedback (i.e., ranking models' outputs relative to each other, or assigning a Likert scale rating to an output). These quantitative metrics offer value, in part because they support automated leaderboards that motivate progress. If our goal is to ensure that AGI systems and their precursors are safe and useful in the real world, however, we need to ensure that we update our benchmarking repertoire to include a wider variety of tasks and metrics that have ecological validity for real-world use cases. Going back to the MMLU example, it is unknown how incremental improvements in MMLU scores translate into real-world utility (i.e., does doing slightly better on GRE-type questions relate to performance on tasks that matter to end users?). The same critique can be applied to the vast majority of status quo ML benchmarks.
Designing metrics and conducting evaluations are core HCI skills. For instance, HCI methods, such as experience sampling, diary studies, participatory design, task modeling, and cognitive walkthroughs, can be used to identify ecologically valid tasks people may wish to accomplish with advanced AI. In addition to identifying tasks, we can also use HCI methodologies to identify success metrics for such tasks, grounded in real user needs and values (i.e., what percent accuracy must an AI system have for a given task to be deemed practical, safe, or ethical to use?). Our community is well poised to develop novel benchmarks that assess what real end users value rather than what is simply easy to measure, including open-ended, multi-step tasks, where the perception of answer quality may be highly subjective to different end users, or even wicked problems that lack optimal solutions. While such tasks might lend themselves to some quantitative metrics (including objective metrics such as correctness and subjective metrics such as Likert ratings or rankings), our community can lead the way in developing benchmarks that use qualitative data and methods for analyzing this new class of benchmark. Such qualitative outputs might include detailed traces of users' interactions with a system, users' open-ended feedback on the quality of a system's response, and meta information about a user's experience.
Human computation methodologies have already proved their value to the ML community in the design of crowd-powered interfaces for tasks such as data enrichment and reinforcement learning from human feedback (RLHF). In addition to continuing to refine these workflows drawing on best practices from HCI, a related area for future work is in developing interfaces to support gathering high-quality data for training future models. While synthetic data might support some applications, there is evidence that overuse of such data may result in "model collapse," wherein AI performance degrades after training on too much AI-generated content [10]. Even if model collapse were not a concern, the diversity of current AI training data remains an issue—there is a need to collect high-quality data from diverse participants (i.e., speakers of low-resource languages, older adults, people with disabilities, etc.), suggesting the necessity of developing interfaces and ecosystems that support the ethical collection of high-quality, multimodal data from diverse stakeholders.
The creation of such a data collection infrastructure is a sociotechnical challenge that would benefit from our community's expertise. HCI researchers can draw on our community's knowledge in developing social computing and crowd computing systems as models for future participatory data architectures. Further, HCI research already has established practices, such as community-based design methods, for considering the complex trade-offs involved in designing systems that are inclusive and can support participation by diverse communities not well represented in current AI training data without unintentionally exploiting those stakeholders (i.e., supporting Lanier's concept of data dignity; https://www.newyorker.com/science/annals-of-artificial-intelligence/there-is-no-ai). Positioning data collection as a continuous process that may occur at many points in a system life cycle may also yield novel insights into efficient, ethical, and privacy-sensitive data collection methods. Different approaches may be required for collecting and curating initial training data, implicit data that is a side effect of interactions with deployed systems, and ad hoc data collection needs to address system errors or biases that become apparent post-deployment.
The next decade will see incredible progress in the capabilities of AI-powered systems. Even if we have not progressed to the highest levels of AGI, the impacts of these technologies on all aspects of society will be profound. Today, many people view AI as synonymous with machine learning. It is true that research on core ML topics such as novel model architectures, scaling, and planning is vital to the progress of AI. Combining these advances in ML with corresponding advances in human-computer interaction, however, is vital to developing safe, responsible AI that is usable by and useful to a diverse set of people. I am confident the HCI community will rise to the occasion and embrace an ambitious program of research (and cross-disciplinary outreach) that will help usher in a new generation of beneficial technologies.
1. Zamfirescu-Pereira, J.D., Wong, R.Y., Hartmann, B., and Yang, Q. Why Johnny can't prompt: How non-AI experts try (and fail) to design LLM prompts. Proc. of the 2023 CHI Conference on Human Factors in Computing Systems. ACM, New York, 2023, Article 437, 1–21.
2. Tang, J., LeBel, A., Jain, S., and Huth, A.G. Semantic reconstruction of continuous language from non-invasive brain recordings. Nature Neuroscience 26, 5 (2023), 858–66.
3. Wing. J.M. Computational thinking. Communications of the ACM 49, 3 (2006), 33–35.
4. Terry, M., Kulkarni, C., Wattenberg, M., Dixon, L., and Morris, M.R. Interactive AI alignment: Specification, process, and evaluation alignment. arXiv:2311.0071, Oct. 23, 2023.
5. Christian, B. The Alignment Problem: Machine Learning and Human Values. W.W. Norton & Company, 2021.
6. Weidinger, L. et al. Sociotechnical safety evaluation of generative AI systems. arXiv:2310.11986, Oct. 18, 2023.
7. Wang, Z.J., Kulkarni, C., Wilcox, L., Terry, M., and Madaio, M. FarSight: Fostering responsible AI awareness during AI application prototyping. Proc. of the 2024 CHI Conference on Human Factors in Computing Systems. ACM, New York, 2024, Article 976, 1–40.
8. Schmidt, A., Elagroudy, P., Draxler, F., Kreuter, F., and Welsch, R. Simulating the human in HCD with ChatGPT: Redesigning interaction design with AI. Interactions 31, 1 (2024), 24–31.
9. Agnew, A. et al. The illusion of artificial inclusion. Proc. of the 2024 CHI Conference on Human Factors in Computing Systems. ACM, New York, 2024, Article 286, 1–12.
10. Shumailov, I., Shumaylov, Z., Zhao, Y., Gal, Y., Papernot, N., and Anderson, R. The curse of recursion: Training on generated data makes models forget. arXiv:2305.17493, May 27, 2023.
Meredith Ringel Morris is a principal scientist and director for human-AI interaction research at Google DeepMind. Her research focuses on human-centered AI, including interaction design, evaluation methods, and societal impacts. She is an ACM Fellow and member of the SIGCHI Academy. [email protected]
1 In his classic book The Design of Everyday Things, Norman introduced the concept of the gulfs of execution and evaluation, arguing that the purpose of a user interface was to bridge these in human-machine communication. The gulf of execution refers to the user's challenge in conveying their intent to a computer; the gulf of evaluation refers to the challenge of the user determining if the machine properly carried out that intent. For instance, graphical user interfaces (GUIs] are more effective at bridging these gulfs for typical end users than their precursor, the command line interface. With a command line interface, there is a large gulf of execution since the user must know a priori what commands are possible to enter, whereas a GUI minimizes this gulf by presenting drop-down menus that reveal the possible commands. Note that while menus reduce the gulf of execution vis-à-vis command line inputs, they do not eliminate it. For instance, command names in a menu may still be difficult for a user to interpret or map to their specific use case. Similarly, one of the advantages of direct manipulation interfaces over prior interaction paradigms was the reduction of the gulf of evaluation—the effect of an action is immediately visible to the end user in direct manipulation paradigms such as WYSIWYG ("what you see is what you get"].
2 Ironically, the anthropomorphized language we use to describe modern ML systems, such as artificial "intelligence," might exacerbate process gulfs, since such terminology may lead end users to form inappropriate mental models of system operations based on analogies to humans. Understanding the impact of anthropomorphizing vocabulary and mental models on AI system usability is another important area of investigation for HCI research, and one that may require revisitation over the longer term as people's familiarity with AI technologies evolves. A related challenge our community may be well poised to address is developing tools and curricula to support AI literacy (i.e., the set of critical thinking skills a typical end user will need to be an informed and productive member of society in the era of widespread use of advanced AI). Jeannette Wing's concept of "computational thinking" from nearly two decades ago [3] could be a starting model for us to consider the question of what "AI thinking" (a less passive reframing of AI literacy) might be and how to convey it to various audiences both explicitly (in our educational curricula) and implicitly (through interface and interaction design).
3 Alignment [5] is a term commonly used in the AI safety and, increasingly, the AI ethics communities that refers to ensuring an AI system is operating in line with the goals and values of people. For instance, alignment concerns in AI safety include both near-term concerns, such as preventing systems from producing harmful outputs (like instructions to make a bioweapon), as well as more theoretical concerns such as preventing advanced AI systems from deceiving their operators. AI ethicists have pointed out the need to consider pluralistic concerns when considering alignment problems, since a single canonical set of human values to align with does not exist. Currently, approaches to alignment tend to focus on model-based approaches—such as including diverse data in the model to support pluralistic alignment—and using methods, such as RLHF, reinforcement learning from AI feedback, and red teaming, to identify safety issues and then fine-tune the model to mitigate identified risks. Considering how user interface design might support alignment goals is an underexplored topic and one that our community is well poised to address.
Copyright is held by the owner/author. Publication rights licensed to ACM.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2025 ACM, Inc.
Post Comment
No Comments Found