Effective Automation to Support the Human Infrastructure in AI Red Teaming

Authors:
Alice Qian Zhang, Jina Suh, Mary L. Gray, Hong Shen

AI systems increasingly make high-stakes decisions, from healthcare diagnostics to financial transactions. Ensuring these technologies align with ethical principles and do not contribute to societal harm is a growing priority. One emerging solution is red teaming: a process that simulates adversarial attacks to uncover vulnerabilities before they can be exploited. Red teaming is a well-known practice in national security. Originally, it was a vulnerability assessment method for military decision making that was later adapted for cybersecurity. Given these origins, it is no surprise that AI red teaming gained momentum as an essential practice for AI security and safety, mandated by the Biden administration's executive order "Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence." Large technology companies, such as OpenAI and Microsoft, followed up, strengthening their red-teaming efforts. They hired dedicated professionals, started collaborating with domain experts in high-stakes fields (cybersecurity, finance, medical, etc.), partnered with contract red teamers through business process outsourcing (BPO) firms, and engaged crowdsourced gig workers. At the same time, efforts to broaden red teaming have expanded to include open calls for public participation, involving everyday users and affected communities with lived experiences. Examples include the public AI red-teaming challenge at the security conference DEF CON 2023, where participants from diverse backgrounds tested AI vulnerabilities.

Insights

→ AI red teaming, such as content moderation, relies on human expertise; automation should enhance human proficiency rather than prioritize efficiency.
→ Overreliance on automation can reduce red teamers' agency.
→ Scaling automation in AI red teaming must ensure human oversight for adaptive, context-aware risk mitigation.

As a result, AI red teaming has increasingly relied on a diverse and growing "human infrastructure"—one that includes hired professionals, outsourced contractors, paid crowdsourced workers, and unpaid volunteers, whose labor is critical in identifying vulnerabilities and stress testing contemporary AI systems. These red teamers operate along three key dimensions: payment structure, required expertise, and relationship with the organizing body. At one end of the spectrum, hired domain experts (e.g., security analysts, medical professionals) are often formally employed or contracted for their specialized knowledge and play a critical role in high-stakes evaluations. Outsourced contractors, engaged through BPOs, are compensated hourly or per task, typically working under structured guidelines. Paid crowdsourced red teamers, such as gig workers on platforms like MTurk and Prolific, perform red teaming on a per-task basis, often with lower initial expertise but the ability to develop domain knowledge over time. At the other end of the spectrum, volunteers with varying levels of expertise and an informal relationship with the governing body, including end users, affected communities, and the general public engage in red teaming out of personal interest or because they see it as an educational opportunity. As AI technologies advance at an unprecedented pace, supporting such human infrastructure behind AI red teaming is essential to keep up with emerging risks and to ensure AI systems remain secure and reliable.

The increasing demand for red teaming has also sparked interest in a variety of automated approaches, where AI technologies are used to test other AI systems [1]. Automation has been explored as a promising solution to improve efficiency, mitigate harm to human red teamers, and enhance scalability. The first argument for automation is cost and efficiency. By automating red teaming, some scholars argue, organizations may address the breadth of the risks they seek to uncover in a more cost-effective and time-efficient manner [2]. The second argument is risk mitigation for human red teamers. Given the potential harms associated with red teaming, such as secondary trauma, well-intentioned efforts exist to develop fully automated approaches that minimize human exposure to distressing or harmful content [3]. The third argument is scalability. As red teaming becomes an essential component of AI safety, there is growing emphasis on scaling up these efforts through automation. The goal is to develop more generalizable and standardized red-teaming models that can be broadly applied across different AI systems [2,3].

While automation has the potential to enhance AI red teaming, the way we conceptualize and implement it may inadvertently constrain opportunities to expand and strengthen the human labor behind red-teaming practices. Indeed, even with all the calls for automation, current red-teaming methodologies lack a standardized definition of success. This inconsistency highlights the need for automation to complement, rather than replace, human judgment in assessing and refining red-teaming methodologies. From our prior work [4] we see clear parallels between the push to automate AI red teaming and the automation of content moderation—a form of data work that involves identifying and removing harmful content from online platforms. In this article, we discuss three principles of automation and how keeping humans in the loop might provide both better red-teaming outcomes and safeguard human expertise and well-being. We engage critically with emerging views on automated AI red teaming, highlighting opportunities to utilize existing and potential technologies to support human engagement in this work. To structure the discussion, we introduce three key pillars of effective automation. First, we examine the need for proficiency, or the development of expertise and skill among red teamers. Next, we discuss the importance of agency, or the ability for red teamers to actively shape, influence, and make meaningful decisions in the process. Finally, we illustrate the importance of adaptability, or the ability of AI red-teaming practices to respond to evolving technologies and threats of harm.

The Case for Proficiency over Efficiency

A key goal of automated AI red teaming is to enhance efficiency and keep pace with the rapid development of AI technologies [2]. Existing approaches prioritize maximizing the quantity of red teaming (e.g., generating large datasets of adversarial prompts) while minimizing the cost and time. These existing approaches view the need for human time and expertise as a fundamental constraint and aim to minimize it. The benefits of these advancements are clear: By leveraging automated red teaming, organizations can conduct risk evaluations in a more cost-efficient, time-effective manner. A singular focus on efficiency, however, risks overlooking how automation might also be leveraged to enhance human expertise and engagement in red teaming.

Content moderation faced a similar challenge when the popularity of social media platforms increased the amount of harmful user-generated content on them. As a result, social media platforms began hiring or eliciting volunteer involvement of human personnel to manually review content to determine if it violated platform policies and guidelines. However, many large technology companies viewed the cost of maintaining a human moderation workforce as too high, even though content moderators were often underpaid and lacked adequate workplace support for their well-being, as revealed by lawsuits and media reports. As a result, interest grew in fully automating content moderation to cut costs. At the same time, automation also promised to increase the volume of reviewed content and speed up the process. Although many social media platforms use automated methods for content moderation, it remains unclear whether automation actually improves effectiveness [5]. What has been made clear is that the focus on processing more content at faster speeds has come at a significant cost to moderators. Initial approaches to automating content moderation potentially contributed to the harsher treatment of moderators by requiring them to be more efficient in keeping up without increasing workplace support. This has contributed to high workforce turnover rates and explicit backlash from workers who demanded greater support by pursuing class action lawsuits.

This historical trajectory serves as a cautionary parallel for AI red teaming, where automation is currently being pursued with a similar emphasis on efficiency. If red-teaming automation follows the same path, we risk missing critical opportunities to use technology not just to make these efforts more efficient but also to enhance the skills and expertise of human red teamers. To support workforce proficiency, we must design tools that support the development of expertise and skill among AI red teamers. We have already surfaced ideas through our prior research of content workers and potential applications of automation [4]. One promising way is to use automated tools to support the training of new professionals to prepare them for exposure to potentially harmful content. For instance, we may develop training datasets that include different levels of exposure to harmful content. Additionally, automated tools can support actual work practices, such as the automated generation of different prompts crafted by experts [2]. Ultimately, we argue for the realignment of automation in AI red teaming to amplify the core human aspects of the work, rather than treating human involvement as a constraint or cost to be minimized.

The Central Role of Human Agency

Another key motivation behind the development of automated AI red teaming is the desire to reduce human involvement due to the potential risks and harms associated with this work [3]. For instance, red teaming often requires direct engagement with harmful content, which can lead to psychological distress. Additionally, exposure to this type of content can lead to fatigue, and the repetitive nature of tasks, such as crafting adversarial prompts, can exacerbate the effect. Finally, there may be disproportionate burdens on marginalized workers expected to represent their unique lived experiences. In this context, automation is positioned as a solution to mitigate these risks by minimizing the direct exposure of human workers to distressing or exploitative content. However, while the intention to address potential harms to AI red teamers may be well meaning, it often leads to binary thinking of applying either completely automated approaches or completely manual processes. Such thinking may limit the consideration of other innovative opportunities for automation. Even worse, this may take away AI red teamers' ability to provide meaningful oversight on AI red teaming. Consider the trajectory that content moderation has followed in recent years. The initial efforts to spread awareness of the challenges content moderators face, such as harmful working conditions and a lack of workplace support, were crucial.

Framing content moderators primarily as victims, however, led to a push for increased automation, often at their expense, rather than improving their working conditions or leveraging their expertise more effectively. For example, even though many of the practitioners we surveyed had access to automated content moderation technologies, such as those designed to limit exposure to harmful content, some still struggled to integrate these tools into their existing workflows. Users and content creators have raised concerns that automated moderation systems may wrongly flag appropriate content for review, especially when it covers sensitive topics such as LGBTQ+ issues from marginalized creators. This could significantly increase the workload for content moderators, requiring them to review a high volume of inaccurately flagged content, which would divert their attention from genuinely harmful material and potentially lead to burnout. This example highlights the need for automation that centers on workers' expertise and ability to make judgment calls. We can imagine how this tool could be improved to consider where content moderators' expertise is essential (i.e., accurately identifying violating content that is nuanced) and where automation can provide meaningful assistance (i.e., automatically flagging obvious violations).

The true value of automation lies in augmenting human capabilities, not replacing them.

Thus, we propose that more research should be conducted to explore automated red-teaming approaches that enhance, rather than diminish, the agency of human red teamers. Tools for content moderation could be adapted or used as a starting point to explore opportunities to reduce the exposure of workers to harmful content in a way that amplifies parts of red teamer workflows that require their expertise. One possibility might be automated tools that limit exposure to harmful content without compromising the ability to identify it, such as using carbonization or artistic rendering of real images that might otherwise be psychologically distressing. Ultimately, when we consider automated red teaming as a complement to human red teaming—enhancing rather than replacing human expertise—we uncover numerous opportunities to better support human involvement, ultimately contributing to improved workforce retention and well-being.

The Limitations of Scaling and Standardization

Much of the justification for using automated approaches to AI red teaming hinges on the need to enhance scalability. Previous arguments stated that, as AI systems grow in complexity and are deployed across increasingly diverse domains, automation enables red-teaming efforts to scale by standardizing methodologies. This standardization not only makes AI risk assessments more broadly applicable but also ensures consistency in identifying vulnerabilities across different AI models. Techniques such as universal filtering mechanisms or generalized adversarial prompts [1] exemplify this shift toward scalable, repeatable red-teaming strategies that can be applied across a wide range of AI technologies with minimal adaptation. This emphasis on scale, however, inherently prioritizes generalizability over adaptability, as scalable methods rely on broad, predefined attack strategies rather than dynamically evolving tactics tailored to specific AI systems.

Therefore, while automation offers clear advantages in terms of consistency and scalability, it also raises concerns about the loss of context-specific adaptability. For example, some risks that AI red teaming may want to target are deeply embedded in community norms, governance structures, and social dynamics, making certain aspects of red teaming, such as the evaluation of misinformation, bias, and content moderation, difficult to generalize without losing critical context. Drawing from our prior work, we found that content moderators often develop domain-specific knowledge over time, including expertise in identifying instances of violence and terrorism, which enables them to make better-informed decisions about instances in which content should be removed. As machine-learning-based systems scale, maintaining adaptability becomes increasingly challenging. Such large-scale systems often struggle to accommodate local variations in data, models, and deployment environments, leading to rigid, one-size-fits-all solutions that may fail in unpredictable ways [6]. There are well-documented instances of automated content moderation models removing content that did not violate platform guidelines (e.g., images of women breastfeeding) or failing to detect content that did violate guidelines (e.g., Facebook failing to detect hate speech during the Myanmar crisis). Automated red teaming, while valuable for identifying certain types of technical vulnerabilities, cannot fully replace adaptable, human-expertise-driven practices. To address this limitation, red-teaming efforts should adopt a hybrid approach, combining scalable automation with targeted, human-driven interventions that remain sensitive to context. In this way, there is much potential to combine approaches of automated generation of prompts [2] with existing human-based approaches. While automation can enhance the scope of AI red teaming, truly effective risk mitigation will require preserving non-scalable interventions, such as case-by-case human oversight and adversarial testing tailored to specific domains.

The future of AI red teaming is at a crossroads. As we have outlined, different visions of automation, labor, and scaling reveal competing priorities. We discussed the tensions of efficiency versus proficiency, the role of human agency, and limitations in scaling red teaming through automation alone. These are not just technical and labor challenges but governance dilemmas that will shape how AI risk mitigation evolves.

If AI red teaming remains an effective mechanism for AI safety, it must be carefully designed to navigate these tensions. Automation should not be pursued at the expense of human expertise, nor should the drive to scale red teaming override considerations of adaptability. The true value of automation lies in augmenting human capabilities, not replacing them. The challenge ahead is not merely to automate red teaming but to ensure that automated systems are designed carefully to empower human experts and support the long term well-being and retention of the red-teaming workforce.

To achieve this goal, we invite researchers, policymakers, and industry practitioners to critically examine the role of automation in AI red teaming and explore how it can serve as an avenue to amplify human expertise rather than diminish it. This means developing tools that support—not supplant—the judgment of red teamers, investing in methodologies that preserve contextual awareness, and fostering career pathways that ensure the long-term well-being and retention of the red-teaming workforce. By centering human expertise within a thoughtfully designed hybrid model, we can build a red-teaming ecosystem that is not only scalable but also rigorous, context-aware, and resilient in the face of evolving AI risks.

References

1. Feffer, M., Sinha, A., Deng, W.H., Lipton, Z.C., and Heidari, H. Red-teaming for generative AI: Silver bullet or security theater? Proc. of the Seventh AAAI/ACM Conference on AI, Ethics, and Society (2024), 421–437.

2. Perez, E. et al. Red teaming language models with language models. arXiv:2202.03286, Feb. 7, 2022.

3. Radharapu, B., Robinson, K., Aroyo, L., and Lahoti, P. AART: AI-assisted red-teaming with diverse data generation for new LLM-powered applications. arXiv:2311.08592, Nov. 14, 2023.

4. Zhang, A.Q., Amores, J., Gray, M.L., Czerwinski, M., and Suh, J. AURA: Amplifying understanding, resilience, and awareness for responsible AI content work. Proc. of the ACM on Human-Computer Interaction 9, CSCW2 (2025), 1–45.

5. Gorwa, R., Binns, R., and Katzenbach, C. Algorithmic content moderation: Technical and political challenges in the automation of platform governance. Big Data & Society 7, 1 (2020), 2053951719897945.

6. Lwakatare, L.E., Raj, A., Crnkovic, I., Bosch, J. and Olsson, H.H. Large-scale machine learning systems in real-world industrial settings: A review of challenges and solutions. Information and Software Technology 127 (2020), 106368.

Authors

Alice Qian Zhang is a Ph.D. student at Carnegie Mellon University. Her research explores the human infrastructure behind responsible AI, focusing on red teaming, labeling, and moderation. She develops methods to enhance the experiences and well-being of those shaping safer, more ethical AI systems. [email protected]

Jina Suh is a principal researcher in the Inclusive Futures group at Microsoft Research. Her work lies at the intersection of technology and human well-being, examining the role of technologies, design choices, development practices, and values in their positive and negative impact on individual and organizational mental health. [email protected]

Mary L. Gray is a senior principal researcher at Microsoft Research, a faculty associate at Harvard University's Berkman Klein Center for Internet & Society, and a MacArthur Fellow. Her work focuses on how everyday uses of technologies transform labor, identity, and human rights. She maintains a faculty position in the Luddy School of Informatics, Computing, and Engineering with affiliations in anthropology and gender studies at Indiana University. [email protected]

Hong Shen is an assistant professor in the Human-Computer Interaction Institute at Carnegie Mellon, where she directs the CARE (Collective AI Research and Evaluation) lab. She is the recipient of a Google Research Scholar Award and a Microsoft AI & Society Fellowship, and has received Best Paper Awards at top HCI venues. [email protected]

This work is licensed under Creative Commons Attribution International 4.0.

ACM Interactions

Forums

Effective Automation to Support the Human Infrastructure in AI Red Teaming

Post Comment

View This Article

Reader Tools

Browse This Issue

SIGN IN