Speech is human and multifaceted. Our approach to studying it should be the same.

Authors: Arathi Sethumadhavan, Joe Garvin, Benjamin Noah
Posted: Wed, June 22, 2022 - 9:57:27

Whether it’s the friendly virtual assistant in your smart speaker, the auto-generated captions on your YouTube video, or the software that physicians use to dictate clinical notes, voice AI has already become a fixture of modern life. It’s the promise of hands-free convenience: Simply speak naturally, and the computer listens, analyzes, and recognizes what you’re saying. With things like voice-controlled homes and cars on the horizon, our relationship with voice AI looks to only deepen. But the task of building speech recognition technology remains a tall order: We want it to work well for everyone, but there is no one way of speaking.

People speak differently depending on how old they are or where they live on a map but less obvious demographic factors like socioeconomic status and education can also play a role. Layers of intersecting variables all come together to influence how we verbally express ourselves. We humans use language like code-switching chameleons, reflecting and creating culture every time we talk. Speech, in other words, is so much more than a mechanical system of words. It is organic and fluid, forever entangled with identity, values, and self-expression.

So it makes sense that studying speech to improve AI models would be more than just an engineering job. In other words, teams grappling with complex problems involving people require diverse representation across disciplines. Working alongside engineers and data scientists, social science experts like sociologists, anthropologists, and sociolinguists can offer essential insight and context while navigating the intricacies of our many voices.

Lingering inequalities in voice AI

Use of computer speech recognition extends far beyond asking Alexa or Siri to play a song or ask for the weather. Court reporters use it to generate transcriptions; people with disabilities use it to navigate phones and computers; and companies use it to make hiring and firing decisions. As voice AI has proliferated in recent years, overall accuracy and reliability has improved dramatically. But even state-of-the-art speech recognition tech does not work well for everyone.

A Stanford University study from 2020, for example, tested services from major tech companies used for automated transcriptions and found large disparities between ancestry groups. The cause? Insufficient audio data when training the models, the study suggests. In other words, the voice AI powering the services was trained on datasets that left out many ways of speaking. In addition to certain ancestry groups, speech recognition systems also struggle with the accents of non-native English speakers, regional accents, and voices of women.

Biased AI starts with a biased worldview

These divides in voice AI have been documented for years, and so have the data-collection missteps that perpetuate them. Why, then, is collecting enough of the right speech data such a stubborn problem? One factor at play here is the out-group homogeneity effect, the cognitive bias that assumes people unlike you are more similar to each other than they really are. “They are alike; we are diverse,” or so the bias would have you believe.

Especially when classifying language, the bias is insidious. Consider, for instance, how all people over 60 are often lumped together into a single group: “older people.” This broad category might be useful in some contexts, but relying on it when studying speech would be irresponsible. Data shows that the way we talk continues to change as we age into our 60s and beyond, and that it even changes differently depending on gender. Tremendous variation exists within the group “older people” that deserves attention. But if someone isn’t 60 or older themselves, out-group homogeneity bias might blind them to all that variation.

Even the terms commonly used to describe Black and African American language—African American Vernacular English (AAVE), African American English (AAE), or African American Language (AAL)—can themselves be seen as examples of the out-group homogeneity effect. In reality, a language variety is never exclusively part of an entire demographic group, and there could also be people of different ancestry groups who happen to speak similarly. When it comes to studying speech for voice AI, creating false monoliths out of certain subgroups isn’t just inaccurate, it’s dangerous. Glossing over meaningful differences in the ways people speak shuts them out of tomorrow and leaves their voices unheard.

The many nuances of language

Many different factors play into speech. Some might be obvious, like where you live or whether that’s your native language. But other factors like health, education, and even historical migration patterns also play significant roles in shaping how a person speaks. Social factors like these contribute to linguistic variations. Anthropological linguists go a step further, suggesting that these factors also actually construct social meaning. In this view, when someone speaks, their voice is doing so much more than simply reflecting their region or ancestry: It’s expressing an identity.

Our gender identity, for instance, can influence how we talk. Culture or ethnic association can also influence how our speech develops, how we use it, and how it may evolve. When we define a specific variation of speech, therefore, we must include these societal factors as its foundational pillars.

Level of fluency, education, gender identity, age language was learned—which of these many interconnected factors are the most decisive in shaping the ways we speak? It’s crucial information to have, as it quickly becomes unwieldly to account for all possible aspects that determine speech. When collecting samples of spoken language to train a new speech model, for example, there are real-world limitations affecting what can be collected: time, money, personnel, and geography, to name a few. Prioritizing all the social factors is a complex job, one beyond the narrow scope of any one discipline.

A multidisciplinary approach

To build speech recognition technology that works well for everybody, we need to capture the right diversity in our data-collection strategies. This involves turning toward those nuances in language, being attentive and curious about them. We know that we want to capture an accurate picture of the incredible variety of human speech, and we also know that many complex dynamics are at play. This calls for a multidisciplinary approach for a better informed, more inclusive perspective.

An engineer might be able to notice the different word-error rates between demographic groups, for example, but a sociolinguist can help explain the different speaking patterns at play, how these patterns show up across communities, and historical reasons for why they emerged. A data scientist can tell you how many people in what groups need to be sampled. Sociologists, demographers, and anthropologists can speak to social behaviors and psychology, aspects that illuminate the subtleties of language. Domain experts like these offer invaluable insights and context, and involving them early on will help us design better datasets that capture human diversity.

Toward more equitable voice AI

Even with the help of other disciplines, building speech recognition systems is incredibly hard work. Training a voice AI model requires a huge amount of speech data. Collecting this data means bringing in lots of people from different population groups, some of whom are difficult to access and recruit, and some of whom, like Native Americans and the First Nations peoples of Canada, haven’t had their speech studied extensively. And when subjects are finally recruited, they need to be taken to a noise-controlled recording facility, asked to follow specific directions, and instructed to read aloud carefully designed paragraphs of text.

The process of creating voice AI is painstaking and resource-intensive as it is—detecting and reducing bias on top of it all only makes the process more difficult. And yet, we must be up to the task. The fact is that speech recognition systems of today, trained on largely homogenous datasets, still don’t work for all groups of people. This is more than a matter of services performing poorly; it’s a matter of dignifying real ways of speaking, real identities and cultures. We must first acknowledge that this problem exists and educate teams about ways of building more equitable voice AI. Then we need to act. Acknowledging the intricate social and cultural dimensions of speech, we might team up with experts from relevant disciplines. With the help of experts like social scientists, product teams are better equipped to think carefully about inclusive dataset design and to devise creative approaches to thorny data-collection obstacles.

Human speech poses engineering problems that transcend the technical. Building voice AI is in fact a sociotechnical endeavor, one that requires diversity in disciplines. The stakes are high, but with an intentional focus to seek out our blind spots, we can collaborate to build voice AI that truly works for everyone.

Posted in: on Wed, June 22, 2022 - 9:57:27

Arathi Sethumadhavan

Arathi Sethumadhavan is the head of research for Ethics & Society at Microsoft, where she works at the intersection of research, ethics, and product innovation. She has brought in the perspectives of more than 13,000 people, including traditionally disempowered communities, to help shape ethical development of AI and emerging technologies such as computer vision, NLP, intelligent agents, and mixed reality. She was a recent fellow at the World Economic Forum, where she worked on unlocking opportunities for positive impact with AI, to address the needs of a globally aging population. Prior to joining Microsoft, she worked on creating human-machine systems that enable individuals to be effective in complex environments like aviation and healthcare. She has been cited by the Economist and the American Psychological Association and was included in LightHouse3’s 2022 100 Brilliant Women in AI Ethics list. She has a Ph.D. in experimental psychology with a specialization in human factors from Texas Tech University and an undergraduate degree in computer science.[email protected]
View All Arathi Sethumadhavan 's Posts

Joe Garvin

Joe Garvin is a writer for Ethics & Society at Microsoft via Allovus, where he helps share research insights with a broader audience. He has previously written for the University of Washington and City Arts magazine. He has a bachelor’s degree in English literature and a master’s degree in communication with a specialization in digital media. [email protected]
View All Joe Garvin's Posts

Benjamin Noah

Ben Noah is a senior design researcher for the Ethics & Society group at Microsoft (Cloud & AI), where he supports strategy on responsible development of AI technologies, focusing on the collection of diverse datasets. His previous research experience included modeling cognitive workload using eye-tracking metrics and the design of modern operator control systems for the refinery industry. He has a Ph.D. in industrial engineering with a specialization in human factors from Penn State University, and a bachelor's degree in mechanical engineering from the University of Illinois. [email protected]
View All Benjamin Noah's Posts

Post Comment

@Stumble Guys (2022 06 30)

Linguistic nuances are always characteristic of each country and it is always interesting

@Christine (2025 01 15)

To align with such a perspective, a multidisciplinary approach drawing from linguistics, psychology, sociology, Chill Guy Clicker neurology, and even philosophy would seem appropriate.

ACM Interactions

Blogs