XXIX.5 September - October 2022
Page: 6
Digital Citation

Speech is human and multifaceted. Our approach to studying it should be the same

Arathi Sethumadhavan, Joe Garvin, Ben Noah

back to top 

Whether it's the friendly virtual assistant in your smart speaker, the autogenerated captions on your YouTube video, or the software ( that physicians ( use to dictate clinical notes, voice AI has become a fixture of modern life. It's the promise of hands-free convenience: Simply speak naturally, and the computer listens, analyzes, and recognizes what you're saying. With voice-controlled homes and cars [1] on the horizon, our relationship with voice AI looks to only deepen. But the task of building speech recognition technology remains a tall order: We want it to work well for everyone, but there is no one way of speaking.

People speak differently depending on how old they are or where they live, but less obvious demographic factors like socioeconomic status and education can also play a role. Layers of intersecting variables come together to influence how we express ourselves verbally. We humans use language like code-switching chameleons, reflecting and creating culture every time we talk. Speech, in other words, is so much more than a mechanical system of words. It is organic and fluid, forever entangled with identity, values, and self-expression.

It makes sense, then, that studying speech to improve AI models would be more than just an engineering job. In other words, teams grappling with complex problems involving people require diverse representation across disciplines. Working alongside engineers and data scientists, social science experts like sociologists, anthropologists, and sociolinguists can offer essential insight and context while navigating the intricacies of our many voices.

Lingering inequalities in voice AI. Use of computer speech recognition extends far beyond asking Alexa or Siri to play a song or asking for the weather forecast. Court reporters use it to generate transcriptions; people with disabilities use it to navigate phones and computers; and companies use it to make hiring and firing decisions. As voice AI has proliferated in recent years, overall accuracy and reliability has improved dramatically. But even state-of-the-art speech recognition tech does not work well for everyone.

A Stanford University study [2] from 2020, for example, tested services from major tech companies used for automated transcriptions and found large disparities between ancestry groups. The cause? Insufficient audio data when training the models, the study suggests. In other words, the voice AI powering the services was trained on datasets that left out many ways of speaking. In addition to certain ancestry groups, speech recognition systems also struggle with the accents of non-native English speakers [3], regional accents, and the voices of women.

Speech recognition systems struggle with the accents of non-native English speakers, regional accents, and the voices of women.

Biased AI starts with a biased worldview. These divides in voice AI have been documented for years, and so have the data-collection missteps that perpetuate them. Why, then, is collecting enough of the right speech data such a stubborn problem? One factor at play is the out-group homogeneity effect, the cognitive bias that assumes people unlike you are more similar to one another than they really are. "They are alike; we are diverse," or so the bias would have you believe.

Especially when classifying language, the bias is insidious. Consider, for instance, how all people over 60 are often lumped together [4] into a single group: "older people." This broad category might be useful in some contexts, but relying on it when studying speech would be irresponsible. Data shows [5] that the way we talk continues to change as we age into our 60s and beyond, and that it even changes depending on gender. Tremendous variation exists within the group "older people" that deserves attention. But if someone isn't 60 or older themselves, out-group homogeneity bias might blind them to all that variation.

Even the terms commonly used to describe Black and African American language—African American Vernacular English (AAVE), African American English (AAE), or African American Language (AAL)—can themselves be seen as examples of the out-group homogeneity effect. In reality, a language variety is never exclusively part of an entire demographic group [6], and there could also be people of different ancestry groups who happen to speak similarly. When it comes to studying speech for voice AI, creating false monoliths out of certain subgroups isn't just inaccurate, it's dangerous. Glossing over meaningful differences in the ways people speak shuts them out of tomorrow and leaves their voices unheard.

The many nuances of language. Many different factors play into speech. Some might be obvious, like where you live or whether the language you speak is your native one. But other factors like health, education, and even historical migration patterns also play significant roles in shaping how a person speaks. Social factors like these contribute to linguistic variations. Anthropological linguists go a step further, suggesting that these factors actually construct social meaning. In this view, when someone speaks, their voice is doing so much more than simply reflecting their region or ancestry: It's expressing an identity.

Our gender identity, for instance, can influence how we talk. Culture or ethnic association can also influence how our speech develops, how we use it, and how it may evolve. When we define a specific variation of speech, therefore, we must include these societal factors as its foundational pillars.

Level of fluency, education, gender identity, age language was learned—which of these many interconnected factors are the most decisive in shaping the ways we speak? It's crucial information to have, as it quickly becomes unwieldly to account for all possible aspects that determine speech. When collecting samples of spoken language to train a new speech model, for example, there are real-world limitations affecting what can be collected: time, money, personnel, and geography, to name a few. Prioritizing all the social factors is a complex job, one beyond the narrow scope of any one discipline.

A multidisciplinary approach. To build speech recognition technology that works well for everybody, we need to capture the right diversity in our data-collection strategies. This involves turning toward those nuances in language, being attentive and curious about them. We know that we want to capture an accurate picture of the incredible variety of human speech, and we also know that many complex dynamics are at play. This calls for a multidisciplinary approach for a better informed, more inclusive perspective.


An engineer might be able to notice the different word-error rates between demographic groups, for example, but a sociolinguist can help explain the different speaking patterns at play, how these patterns show up across communities, and historical reasons for why they emerged. A data scientist can tell you how many people in what groups need to be sampled. Sociologists, demographers, and anthropologists can speak to social behaviors and psychology, aspects that illuminate the subtleties of language. Domain experts like these offer invaluable insights and context, and involving them early on will help us design better datasets that capture human diversity.

Toward more-equitable voice AI. Even with the help of other disciplines, building speech recognition systems is incredibly hard work. Training a voice AI model requires a huge amount of speech data. Collecting this data means bringing in lots of people from different population groups, some of whom are difficult to access and recruit, and some of whom, like Native Americans and the First Nations peoples [7] of Canada, haven't had their speech studied extensively. And when subjects are finally recruited, they need to be taken to a noise-controlled recording facility, asked to follow specific directions, and instructed to read aloud carefully designed paragraphs of text.

The process of creating voice AI is painstaking and resource intensive as it is—detecting and reducing bias on top of it all makes the process even more difficult. And yet, we must be up to the task. The fact is that speech recognition systems of today, trained on largely homogenous datasets, still don't work for all groups of people. This is more than a matter of services performing poorly; it's a matter of dignifying real ways of speaking, real identities and cultures. We must first acknowledge that this problem exists and educate teams about ways of building more-equitable voice AI. Then we need to act. Acknowledging the intricate social and cultural dimensions of speech, we might team up with experts from relevant disciplines. With the help of experts like social scientists, product teams are better equipped to think carefully about inclusive dataset design and to devise creative approaches to thorny data-collection obstacles.

Human speech poses engineering problems that transcend the technical. Building voice AI is in fact a sociotechnical endeavor, one that requires diversity in disciplines. The stakes are high, but with an intentional focus to seek out our blind spots, we can collaborate to build voice AI that truly works for everyone.

back to top  References

1. Walker, A. J.D. Power: Consumers in the U.S. and Germany want the same voice ai from home in their vehicles. Jun. 23, 2021;

2. Koenecke, A. et al. Racial disparities in automated speech recognition. PNAS117, 14 (2020), 7684–7689;

3. Wiggers, K. Alexa and Google Assistant are 30% less likely to understand non-American accents. Venture Beat. Jul. 19, 2018;


5. Fougeron C, Guitard-Ivent F, and Delvaux V. Multi-dimensional variation in adult speech as a function of age. Languages 6, 4 (2021), 176;

6. King, S. From African American Vernacular English to African American Language: Rethinking the study of race and language in African Americans' Speech. Annual Review of Linguistics 6, 1 (2020), 285–300;

7. Boberg, C. Ethnic dialects in North American English;

back to top  Authors

Arathi Sethumadhavan is the head of research for Ethics & Society at Microsoft, where she works at the intersection of research, ethics, and product innovation. Prior to joining Microsoft, she worked on creating human-machine systems that enable individuals to be effective in complex environments like aviation and health care. She has a Ph.D. in experimental psychology with a specialization in human factors from Texas Tech University and an undergraduate degree in computer science. [email protected]

Joe Garvin is a writer for Ethics & Society at Microsoft via Allovus, where he helps share research insights with a broader audience. He has previously written for the University of Washington and City Arts magazine. He has a bachelor's degree in English literature and a master's degree in communication with a specialization in digital media. [email protected]

Ben Noah is a senior design researcher for the Ethics & Society group at Microsoft (Cloud & AI), where he supports strategy on responsible development of AI technologies, focusing on the collection of diverse datasets. His previous research experience included modeling cognitive workload using eye-tracking metrics and the design of modern operator control systems for the refinery industry. He has a Ph.D. in industrial engineering with a specialization in human factors from Penn State University, and a bachelor's degree in mechanical engineering from the University of Illinois. [email protected]

back to top 

Copyright held by authors

The Digital Library is published by the Association for Computing Machinery. Copyright © 2022 ACM, Inc.

Post Comment

No Comments Found