Salvage Anthropology and Low-Resource NLP: What Computer Science Should Learn from the Social Sciences

Authors:
David Gray Widder, Tamara Kneese

We have observed a rising current within computer science attempting to bend the field toward some notion of social good—from ethics in AI, participatory methods including marginalized communities in HCI, to justice-oriented programming language design. In this current, some have rightfully looked to the social sciences to fill quantitative deficits with qualitative methods, to attend to questions of context, and to think about how the technology we build will (not) fit into society. At its most frenzied, this can occasionally lead to a cult of interdisciplinarity and a fetishization of the social sciences, tacitly assuming that adding a social scientist or two and stirring will naturally lead to more just futures.

Insights

→ Computing can and should learn from anthropology's turn toward critical reflexivity.
→ Inclusion in technological systems is not, by itself, good.
→ Those doing low-resource NLP should examine whether making global majority languages more legible to large tech companies is an ethical goal to pursue.

But lionizing the social sciences will also lead us astray. In this article, we suggest a more specific way that computer science can learn from the social sciences—by embracing a critical and reflexive turn. This is akin to how anthropology sought—and is still seeking—to come to terms with its violent colonial past and the continued role of many academic fields in perpetuating inequalities. We compare low-resource natural language processing (NLP), which attempts to build computational tools that work equally well in marginalized languages as they do in English, with salvage anthropology, which sought to preserve supposedly dying Indigenous cultures for posterity, often by collecting their artifacts and languages within powerful institutions. Anthropologists in the late 19th and early 20th centuries believed they were engaged in a noble cause and, while various academic disciplines have long been agents of colonialism and exploitation, this example speaks to the larger problem of extraction done in the service of preservation and knowledge production. There are clear connections between linguistic anthropology's origins and the work being carried out by low-resource NLP researchers today, approaching linguistics from another disciplinary angle. In drawing this comparison, we urge computer science to engage in its own critical turn by learning how anthropology engaged in its own.

What is Salvage Anthropology?

Anthropology emerged as a discipline for empirically documenting the differences between groups of people, namely European settlers and colonized Others. As many renowned anthropologists have noted over the decades, colonialism is foundational to anthropology as a field. Anthropology as a discipline very much imagined itself to be a science, and ethnography was just one of anthropology's methods toward that empirical aim. Anthropology once privileged the role of the participant observer as an objective outsider.

Anthropology is also a discipline predicated on preserving the cultures of colonized peoples whose ways of life have been disrupted by the colonizer's presence and sometimes directly funded by colonial governments. James Cowles Prichard contended in 1839 that Indigenous cultures should be carefully studied and documented before their likely extinction [1]. Later in the 19th century, Franz Boas, a founder of American anthropology, stole Native American skulls as part of his data collection. Boas famously said: "It is most unpleasant work to steal bones from a grave, but what is the use, someone has to do it." The vanishing savage trope tied Indigenous people to the natural history of threatened landscapes and habitats.

Linguistic anthropologists also documented the speakers of Indigenous languages they thought would soon die out. The story of Ishi, the last living member of the Yahi tribe who was in many respects treated as a specimen rather than a person, is an example of this phenomenon. Judith Kaplan and Rebecca Lemov trace Ishi's story using the University of California, Berkeley's archives, especially the recordings of his speech in conversation with anthropologist Alfred Kroeber (who also happened to be Ursula K. Le Guin's dad). Each performance is one of a kind within Yahi culture, but anthropologists treated the recordings as representative data.

Now those recordings have been digitized and are used in current language revitalization efforts. Ishi died in 1916, but in the 1980s the data was used to attempt to revive the Yahi language via computer analysis [2]. UC Berkeley is working on language revitalization using these archives from the fraught history of anthropology. For example, Berkeley's Breath of Life program was started in 1995 for Californian languages with no or very few speakers left (http://bit.ly/40nligp). While the recordings themselves are useful and part of revitalization efforts that include language speakers, it is also important to examine the power structures that allowed this preservation work to occur, including the treatment of language speakers like Ishi.

What is Low-Resource NLP?

Low-resource NLP is a subfield that seeks to ensure natural language processing systems perform well on so-called low-resource languages, often defined as languages that do not have a significant Internet presence. A 2021 survey on low-resource NLP [3] states that "most of today's [NLP] research… ignores thousands of languages with billions of speakers," and goes on to say that "this indicates the unused potential to reach millions of speakers who currently have no access to higher-level NLP applications." Both statements present insufficient access to NLP tools as a self-evident problem for marginalized communities. The survey says that the subfields' various techniques "share the motivation of overcoming the lack of labeled data by leveraging further sources" of data. Other work explicitly positions insufficient work in low-resource languages as an inclusion and access issue: "languages which have a large amount of resources and therefore access to the benefits of the current NLP breakthroughs" [4]. Recent critical analysis of the concept of low-resource NLP examines how economic and colonial structures leave some languages without resources, and they subsequently become low-resourced [5].

Participation is not a guarantee of equity in research, and it can lead to problems ranging from stakeholder fatigue to co-optation.

We believe, however, it is worth questioning this sometimes unstated assumption that better NLP for Global South languages is obviously better for speakers from the region. For this, we learn from Anna Lauren Hoffmann, who theorizes about the "relationship between inclusion [our emphasis] and the violent potentials of data science and technology." She writes about how inclusion in datasets is not a good unto itself and can indeed be harmful. This happens if the system built using these datasets is then used, for example, for surveillance or disciplinary purposes [6]. And we need not strain our imagination too hard to think how this might happen in NLP: The Chinese Communist Party is currently carrying out severe human rights abuses against ethnic minority Uighurs, digitally assisted by intense surveillance in both the physical and online worlds. Given this, when we see attempts to perform NLP on the Uighur languages, including studies from top state-funded Chinese universities, we must ask: Will these systems be used to empower Uighur speakers or to make already marginalized and targeted groups more visible to further surveil them?

Anthropology's Critical Turn: Participation is Not a Salve

Given the complex history of anthropology, it is somewhat ironic that anthropology or the social sciences in general are sometimes presented as a way to make technology production more ethical.

The participatory turn in computer science and HCI turns to anthropology's history of action and militant research and the need for community engagement in the research and development process. Such participatory methods borrow from disciplines like anthropology. But participation is not a guarantee of equity in research, and it can lead to problems ranging from stakeholder fatigue to co-optation. Researchers must take care to ensure that the participation process is equitable and considers the needs of participants (see work by Meg Young, Abeba Birhane, Fernando Delgado, Mona Sloane, and others).

Computer science can learn from anthropology's own critical turn as a way of situating the participatory turn in AI. Anthropology itself has gone through several attempts at self-rehabilitation. The reflexive turn in anthropology dates back to the 1960s and 1970s, when anthropologists reckoned with their field's role in colonialism [7]. This turn in anthropology coincides with the rise of critical ethnic studies, women's studies, and Marxist cultural studies. Anthropologists looked at histories of the field's relationship to power, turning to studying up, and researching Western cultures and intervening in new fields like science and technology studies. This is not to say that anthropology no longer reckons with its past, as major universities still hold the physical remains of Indigenous people, which have not been returned through repatriation. Computer science can look to anthropology's history for an understanding of how even the best of intentions can go wrong.

For example, Google Translate recently added languages through the application of generative AI. The ethnographer and media studies scholar Zane Griffin Talley Cooper remarked that many of his interlocutors from Greenland were excited to see Greenlandic included but also laughed at the errors introduced by the system. Technologists should not do this work without collaborating with language speakers and centering the needs of communities to prevent these kinds of errors, which can further harm language speakers by confusing pronunciation and producing bad datasets. Is this model building done to empower language speakers? Or is this service another form of colonial control that uses powerful models to translate Greenlandic in ways that serve the state rather than the speakers?

Applying Abeba Birhane's relational ethics, researchers from Stanford and Masakhane (www.masakhane.io/) discussed the importance of decolonizing NLP and the need for a power analysis, noting that Africans often lack agency over their data and stories. Even if there is an attempt to expand the available training data for African languages, if African language speakers themselves do not have control over the data and the ways it is used, this is a problem, especially because well-intended corporations can "erase linguistic complexities and disfigure the language" [8].

While the social sciences should not be lionized, computer science can learn from how anthropology has faced its demons.

The labor and data infrastructures needed to do this work is another element that complicates the noble pursuit of linguistic inclusion. One example that illustrates this is incarcerated people in Finland training AI due to the lack of resources in Finnish (https://bit.ly/4j9QBT9). Thus, it is impossible to talk about the benefits of low-resource NLP without considering the role of annotators and their labor conditions. With any such project, it is crucial to ask who is doing this work and what the power dynamics are behind it. Are they being fairly remunerated and are their community's needs being met? Who owns the data? Is meaningful consent sought and earned? Projects such as the Data Workers' Inquiry, led by organizers at the Distributed AI Research Institute, Turkopticon, and the Weizenbaum Institute, empower data workers to conduct research on their own workplaces and in their own communities, bringing these questions to the fore and grounding them in autoethnographic understanding.

Computer scientists should consider meaningfully including Indigenous academic expertise as well as the perspectives of communities on the ground. For example, the Native studies perspective is quite different from disciplines like linguistics or anthropology in framing the problem of low-resource NLP. Language reclamation is a powerful way of framing low-resource NLP if it is actually used to empower people in the Global South rather than the notion that protecting endangered languages will be good for science [9].

Our point is not that this work should not be undertaken. We would argue that this kind of work is crucial to bolstering language revitalization efforts and ensuring inclusion. The point is that this work should be done with the input and consent of language speakers, with a goal to shift power to them by giving them control and ownership of the data and tools, and with attention to the labor conditions, power dynamics, and data infrastructures involved with this process.

Conclusion: What Can Computer Science Learn from Anthropology?

Computer science can learn from anthropology's attempts at reckoning with its own problematic history to offer more-reparative approaches that center communities' needs and trouble power structures. While the social sciences should not be lionized, computer science can learn from how anthropology has faced its demons.

We reiterate Philip Agre's call for self-reflection to be built into computer science and suggest that critical histories can contextualize how the field creates and reifies power structures. We look to Meredith Whittaker's history [10] of early computing trailblazer Charles Babbage's involvement in the British slave trade, as well as numerous histories of IBM's computing support for the Nazis, and military applications today, such as Google's ongoing provision of services to the Israel Defense Forces. Embedding critical understandings of our field's past will hopefully make it easier to be critical of our present, enabling researchers to reckon with legacies of colonialism and other power structures that inform their work.

As others have pointed out, participation by marginalized groups is not enough. The question is on whose terms do participation and consent occur? How can we position communities as researchers in their own right, as they collect their own data and determine what the focus of the work should be? There are examples of Indigenous researchers and technologists using AI as they see fit when it comes to language revitalization efforts, and we can learn from them. Te Hiku Media (tehiku. nz/) is one of the best examples out there of how archives can be part of language revitalization efforts that are community led and controlled. Te Hiku digitized archives of recordings of Māori speakers to make them accessible to new generations. The organization ensures that all the decisions are made by Māori and that there is training to help the community control the data and the tools and to use them for their own purposes. These tools are not controlled by a company like Google or a university. Another example is Ghana NLP (ghananlp.org/), an open source project intended to fill in gaps left by Google Translate.

Critical histories should find their way into mainstream computer science textbooks and be part of how we tell the history of our field, and the initiatives that shift power toward community-led efforts should be part of our field's future.

References

1. Gruber, J.W. Ethnographic salvage and the shaping of anthropology. American Anthropologist 72, 6 (1970), 1289–99.

2. Kaplan, J. and Lemov, R. Archiving endangerment, endangered archives: Journeys through the sound archives of Americanist anthropology and linguistics, 1911–2016. Technology and Culture 60, 2 (2019), 161–87.

3. Hedderich, M.A., Lange, L., Adel, H., Strötgen, J., and Klakow, D. A survey on recent approaches for natural language processing in low-resource scenarios. arXiv:2010.12309, Oct. 23, 2020.

4. Joshi, P., Santy, S., Budhiraja, A., Bali, K., and Choudhury, M. The state and fate of linguistic diversity and inclusion in the NLP world. In Proc. of the 58th Annual Meeting of the Association for Computational Linguistics. D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault, eds. Association for Computational Linguistics, 2020, 6282–93.

5. Nigatu, H.H., Tonja, A.L., Rosman, B., Solorio, T., and Choudhury, M. The zeno's paradox of "low-resource" languages. In Proc. of the 2024 Conference on Empirical Methods in Natural Language Processing. Y. Al-Onaizan, M. Bansal, and Y.-N. Chen, eds. Association for Computational Linguistics, 2024, 17753–74.

6. Hoffmann, A.L. Terms of inclusion: Data, discourse, violence. New Media & Society 23, 12 (2021), 3539–56.

7. Asad, T., ed. Anthropology & the Colonial Encounter. Humanities Press, 1995.

8. Ògúnrèmí, T., Nekoto, W.O., and Samuel, S. Decolonizing NLP for "low-resource languages": Applying Abeba Birhane's relational ethics. GRACE: Global Review of AI Community Ethics 1, 1 (2023), Article 1.

9. Leonard, W.Y. Refusing "endangered languages" narratives. Daedalus 152, 3 (2023), 69–83.

10. Whittaker, M. Origin stories: Plantations, computers, and industrial control. Logic(s), 19 (2023); https://logicmag.io/supa-dupa-skies/origin-stories-plantations-computers-and-industrial-control/

Authors

David Gray Widder is a postdoctoral fellow at the Digital Life Initiative at Cornell Tech. david.g.widder@gmail.com

Tamara Kneese directs the Data & Society Research Institute's Climate, Technology, and Justice program. tkneese@datasociety.net

This work is licensed under creative commons attribution international 4.0.

ACM Interactions

Forums

Salvage Anthropology and Low-Resource NLP: What Computer Science Should Learn from the Social Sciences

Post Comment

View This Article

Reader Tools

Browse This Issue

SIGN IN