OpenSpeaks Before AI: Frameworks for creating the AI/ML building blocks for low-resource languages

Authors: Subhashish Panigrahi
Posted: Wed, April 05, 2023 - 9:23:00

There has been a tremendous push on many levels to make artificial intelligence– and machine learning–based applications ubiquitous. Soon, the life decisions of almost every digital technology user will be affected by some form of algorithmic decision making. However, the development of large language models (LLMs) that drive this research and development often lacks participation from diverse backgrounds, ignoring historically oppressed communities such as Black and other ethnolinguistic or socioeconomic minority groups, women, transgender individuals, people with disabilities, and elderly individuals globally, and the Dalit-Bahujan-Adivasi communities in South Asia and the diaspora. Data about and by these people is therefore systematically suppressed. Even more problematic is that this data is mostly suppressed in creating the LLMs driving AI/ML research and development.

Furthermore, seemingly public information might not always be collected ethically with informed consent from the people affected. Even mature regulatory frameworks such as the General Data Protection Regulation (GDPR) in the European Union do not provide enough guidance on how private data is collected, stored, and shared. Naturally, those behind LLM creation do not have a clue about the biases in their data or how it is collected. Take the case of DALL-E 2 models, which use publicly available images owned and copyrighted by different people, or ChatGPT, which uses massive datasets from multiple sources. In both instances, not only does the LLM creation lack the representation of marginalized groups and contain only biased data about them, but also the outcomes that derive from the training data make these groups even more vulnerable.

Palestinian children walk past rubbish next to a paved path reserved for Israeli settlers. Bias in language data is often understood by asking who has access and participation in knowledge creation and dissemination. (CREDIT: CPT Palestine, CC BY 2.0)

Low-Resource Languages
The creation of LLMs like GPT-3, when used in applications such as chatbots, directly affects the dominant-language users. Underpaid tech support workers subcontracted to support users in developed countries might even see these chatbots as a potential threat. But when it comes to low- and medium-resourced languages, the issues stemming from biases and low representation can aggravate things further. The issues of many Indigenous, endangered, and low-and medium-resourced-language native speakers are poorly documented or missing in HCI research and development, particularly in AI-based tech innovations. For instance, issues with script input or other technological problems are generally documented and fixed for the most well-established and dominant writing systems and languages. Speakers of many languages spoken and written in nondominant settings do not often have the know-how or the means to report these issues publicly, or discuss them privately.

OpenSpeaks Before AI 
Mozilla defines trustworthy AI as “AI demonstrably worthy of trust, tech that considers accountability, agency, and individual and collective well-being” [1]. As a part of this, Mozilla started the MozFest Trustworthy AI Working Groups; as members of the 2021 working group cohort, we at the O Foundation piloted an experimental framework called OpenSpeaks Before AI [2].

Instead of treating AI as a stand-alone area, we looked at a few open-source platforms that allow users to generate multilingual big data (useful for AI/ML) and audit them openly. We tried to see whether this pilot could help us derive best practices that were inclusive in nature and relevant for low- and medium-resourced languages. Broader feminist viewpoints [3] and two existing studies primarily inspired the process: a seminal paper titled “Datasheets for Datasets” [4], which focuses on identifying gaps and biases in datasets, and our own research on the Web content monetization in two Indigenous languages from India: Ho and Santali [5]. We conducted two open audits in two languages, Odia and Santali, and of two recording platforms, Lingua Libre and Mozilla’s Common Voice, both of which help in creating multilingual speech data. Odia is a macrolanguage from India with nearly 45 million speakers; Santali is an Indigenous Indian language spoken by 9.6 million people.

Lingua Libre and Common Voice are open-source platforms that allow users to record words and phrases (Lingua Libre) and sentences (Common Voice). The Lingua Libre study and its outcomes were explained in detail in the Wiki Workshop 2022, focusing on Odia and its Baleswari dialect [6]. The audit of Common Voice for Santali was presented during Mozilla Festival (MozFest) 2022 [6]. The OpenSpeaks Before AI framework covers six main areas:

  • Purpose and affordability: reasons a user uses a platform or a contributor contributes to developing it and how affordable it is for them to use/contribute
  • Hardware and platforms: devices and other platforms they use
  • Accessibility: accessibility issues and needs for those using a platform
  • Project launching: the preparation that leads to the first significant use/launch
  • Privacy: privacy- and ethics-related advantages and concerns
  • Diversity: diversities of different kinds (e.g., gender, caste, affordability/access, race, ethnicity).

The framework itself is neither conclusive nor a restrictive guide. Rather, it collects some critical details about platforms that help people build LLMs and speech synthesis applications, and about their users and/or contributors. Audits can be conducted by both users and nonusers of a platform—and, importantly, by researchers or developers tied to a platform. The audit can be imagined in the same way as ethnographic user research, revealing what is working or not working and indicating what needs to be removed or improved. The open frameworks created based on the initial audits are also open to modification—as the products of a pilot, they have a lot of room for improvement. 

Santali-language Wikipedia editors being celebrated in Bhubaneswar, Odisha, India. Despite its official recognition and being spoken by 7.6 million Santal people, Santali is yet see widespread use, leading to poor online representation of Santals and their community knowledge. (CREDIT: R Ashwani Banjan Murmu / CC-BY-SA-4.0)

The foundational layers of OpenSpeaks ( as a set of open educational resources lie in multimedia language documentation and emphasize Indigenous, endangered, and low- and medium-resource languages. It was initially intended to help citizen documenters and archivists with stand-alone audio and video projects, including documentary films, but now encompasses building multimedia language data. Since our tested languages have writing systems of their own, it was straightforward to publish the audit reports as text. Auditing platforms that use oral-only languages and dialects, however, can also be done through audio and video interviews. These could be useful to strengthen the foundational areas in languages that lack resources before moving on to building massive speech data or LLMs. Foundational layers such as word or speech corpora often help in the research and development of many vital tools such as typefaces, input tools, and text-to-speech and speech-to-text tools. Broadly speaking, open auditing can also help in identifying gaps and advocating for resources for priority areas. Like any other correcting mechanism, open auditing is not foolproof. It can only help us see gaps and add a layer of accountability by incorporating practical tools such as “datasheets for datasets” [4].

1. Mozilla. Creating trustworthy AI: README;
2. Panigrahi, S. and Hembram, P. OpenSpeaks Before AI (1.0). O Foundation. Sep. 2021;
3. Acey, C.E., Bouterse, S., Ghoshal, S., Menking, A., Sengupta, A., and Vrana, A.G. Decolonizing the Internet by decolonizing ourselves: Challenging epistemic injustice through feminist practice. Global Perspectives 2, 1 (Feb. 2021);
4. Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J.W., Wallach, H., Dumé, H., III, and Crawford, K. Datasheets for datasets. Communications of the ACM 64, 12 (Nov. 2021), 86–92;
5. Panigrahi, S., Hembram, P., and Birua, G. Understanding web content monetization in the Ho and Santali languages;
6. Panigrahi, S. Building a public domain voice database for Odia. Companion Proc. of the Web Conference 2022. ACM, New York, 2022, 1331–1338.

Posted in: on Wed, April 05, 2023 - 9:23:00

Subhashish Panigrahi

Subhashish Panigrahi is a public interest archivist and researcher, civil society leader, and a nonfiction filmmaker interested in tech, society, media, Open Culture, and digital rights. He founded OpenSpeaks and cofounded the O Foundation. [email protected]
View All Subhashish Panigrahi's Posts

Post Comment

No Comments Found