Sara Basson, Peter Fairweather, Vicki Hanson
Many were first seduced by the promise and potential of speech technology after observing HAL’s conversations with Dave Bowman in 2001: A Space Odyssey. Progress in the accuracy of speech recognition and the clarity of speech synthesis coaxed many to believe that tasks performed with technologyespecially the computerwould be accomplished more quickly, more easily, and, perhaps, less painfully through speech-enabled interfaces.
Technologies diffuse at different ratesinstantly, like the ATM or plodding, like the fax (Scottish inventor Alexander Bain built the first one in 1843!). No one knows how long it will take for speech-enabled devices to reach everyday use. The vision of controlling electrical appliances, managing the home or workplace environment, or engaging the mounting complexity of modern automobiles should be embraced by mainstream populations and could be particularly beneficial to older adults.
Reduced sensory, motor, and cognitive abilities are normal correlates of aging. This can impair older adults’ ability to interact successfully with their environment, especially with new forms of technology. Service and product designers might respond to older adults’ declining abilities with interfaces made simpler or more redundant through the use of alternate modalities such as speech recognition and synthesis.
While text-to-speech and speech recognition engines are built upon 40 years of research and development, our understanding of how to use these technologies is still in its infancy. With our aging population, the opportunity to experiment with speech-enabled interfaces becomes an interesting case for how these technologies can be harnessed for users who have difficulties with traditional interfaces. Such need notwithstanding, we must be mindful that effective design with speech is not obvious. Measurement of interaction patterns and performance involving speech has been a litany of surprise.
Speech Technology Performance and Design Considerations
Speech interfaces are generally designed with the goal of increasing user satisfaction, productivity, and convenience. We consider here speech input and output in the interface.
Speech is often referred to as the "natural" interface, compared with writing or typing, which are "learned" activities. Speech is "natural," however, when it allows the discourse flexibility of conversational interaction. For most speech applications, speech outside of the application input results in errors or system failure. The "naturalness" of speech interaction dissolves if it requires users to eliminate all extraneous conversation. Users can be engaged in speech-automated dialogue, while also opting to make a comment to another person in the room. If a momentary side conversation confuses the automated system, then the interaction with the speech system will be more frustrating than "natural."
Determining the usability of speech interfaces requires performance assessments for using the raw technologies. Even a well-designed speech-based application will be undesirable if the user’s speech is consistently misrecognized or the messages back to the user are not apprehended. The interaction requirements of an older population, however, will do even more to determine success in using speech-automated systems. If speech applications can be designed and constructed successfully, there is considerable value in making these tools available to older users.
In terms of input, automated voice response systems are becoming pervasive worldwide. Speech interfaces can eliminate the need to recall associations between arbitrary digits and actions presented in the voice-response menus. For example, a system that queries a user with "How may I help you?" is likely more usable and intuitive than a system that prompts a user to "Press 1 for service, press 3 for repair, or press 7 to speak to a customer representative."
Karl and colleagues  observed cognitive load induced by an interface enabled for speech recognition. Users of a speech-enabled word-processing system enjoyed an improvement on operations like "page up," "bold face," or "italics" over those using keyboard-only systems. However, when these two groups were asked to remember some mathematical symbols, perform a "page down," and enter the symbols on the next page, those using speech performed more poorly and had to keep going back to the first page to examine the symbols again. The demands on memory and attention imposed by having to retrieve and apply the spoken commands were enough to interfere with the task of learning the symbols. We mention this finding not because it concerns older adults specifically, but because it is emblematic of the subtle, deep, and "invisible" effects that accompany even slightly complicated interactions with the environment. Designers of speech-enabled interfaces must pay attention to the additional cognitive load that may accrue when speech-related tasks are added to the set of things users must perform.
Studies on older adults using speech recognition paint a complex picture. For example, it has been found that older adults read text into a computer at a rate that is slower than younger users and that older adults’ speech appears to be less successfully recognized than the speech of younger speakers. On the other hand, older users believe that speech recognition will be very useful, and they do not require longer periods of training in order to calibrate the speech software, even though voice spectra of older adults are more variable than younger adults’ .
One cognitive trait that sometimes declines with advancing age involves the ability to inhibit behavioral and cognitive responses (Kramer, et. al, 1992). Speech-enabled interfaces provide many opportunities where such deficits could impair performance. For example, as Hanson  and her colleagues describe, when older adults are presented with an interface that responds to spoken directives, they frequently cannot stop talking after giving a command. Retrieving a speech command sometimes results in the recovery of some intermediate product of the search and sort task that "pops out" prematurely instead of being suppressed. This sometimes frustrates older adults because they are aware that they "know" the correct response in spite of the intrusion of the other response.
Attempts to use speech output with older adults have also brought mixed results, with differences in users’ preferences and performance based on speech output in automated systems. Speech output can vary along a number of dimensions. The text-to-speech can be recorded or synthesized, the voice can be male or female, and delivery quality can vary along a range of other qualitative dimensions (fast or slow speech; authoritative or neutral delivery, etc.). Some work has begun to give a picture of preferred text-to-speech options for older adults . For example, users appear to prefer male voices over female voices, and natural speech over synthetic. The ability to comprehend speech declines with age, however, being more pronounced for synthetic than natural speech.
A number of design best practices have been proposed based on studies of older adults engaging with speech automated systems. In particular, Zajicek has identified "patterns" for designing speech systems for older adults . These patterns include the following recommendations:
- Keep output messages as short as possible. The "menu choice" message should offer at most three selections.
- Reduce choices wherever possible.
- Use mnemonic letters to indicate key press menu selections.
- Include confirmation messages, increasing the users’ confidence in the interaction.
- Create "context-sensitive" help messages.
A dilemma would emerge, if successful design for older users were to run counter to preferences and desired patterns for younger populations. Importantly, however, Zajicek suggests that the patterns recommended for older users tend to represent improved design for the population at large.
Tasks as Cognitive Artifacts
To explore how well older users could operate a speech-enabled Web browser, Hanson and her colleagues observed users where the speech recognition was provided by an experimenter . This experimenter operated "behind the curtain" in a "Wizard of Oz" protocol so that the older adults involved in the study were able to operate the browser using speech, but without any problems associated with faulty speech recognition.
Removing the problems of recognition accuracy and the understandability of synthesized speech permitted these investigators to directly examine how well the users could perform tasks by operating indirectly on them through a layer of speech. Indeed, one of the more prominent observations derived from that indirection itself. "Speech-enabled Web browsing" is really "speech-enabled point-click-and-scroll browsing," where speech is used to carry out what would normally be done more intuitively with a mouse and a keyboard. Successful users need to utter sequences like "scroll down," "scroll down," "back," "back," "tab," "tab," "tab," "return," mirroring the successful user at a manual interface. A user may well know what he or she wants to achieve, but might not have the lexical jargon needed to articulate commands.
One of the Hanson team’s findings was that success with a speech-enabled interface was possible only if the user could perform the underlying task and had some declarative knowledge to draw upon to direct the speech interface. Declarative knowledge is loosely defined as "knowing what," as opposed to procedural knowledge that encodes "knowing how." This leads us to a paradox: An expert in the performance of a task using technology may be unable to perform the task because a speech interface demands the use of particular termsterms that a procedural expert may never have known or may have forgotten.
Where Are We and Where Should We Go?
We have tried to show that there are critical design issues with speech-enabled interfaces, many of which are independent of the accuracy of speech recognition or the clarity of speech synthesis. Although progress toward greater accuracy is essential to some applications and an advantage to all, other issues demand attention. How does the user understand the task? How does the task change if speech is introduced? What demands does such a change put on the user’s understanding? In order for these technologies to add value, the applications designed need to consider the typical physical, perceptual, and cognitive changes that occur with aging.
But these are issues that affect all users at the interface, not just older adults. If anything, we should cherish older users, for they will be the first to detect if we have failed to think carefully about the design and application of speech-enabled interfaces.
2. Jastrzembski, T., N. Charness, P. Holley, and J. Feddon. "Aging and Input Devices: Voice recognition performance is slower yet more acceptable than a lightpen." Proceedings of Human Factors & Ergonomics Society: Orlando, FL, 2005.
3. Karl, L., M. Pettey, and B. Shneiderman. "Speech Versus Mouse Commands For Word Processing Applications: An empirical evaluation," International Journal for Man-Machine Studies 39, no. 4 (1993): 667-687.
4. Lines, L. and K.S. Hone. "Older adults’ evaluations of speech output." In Proceedings of the Fifth international ACM Conference on Assistive Technologies (Edinburgh, Scotland, July 8-10, 2002). Assets `02. New York: ACM Press, 2002, 170-177.
5. Zajicek, M. "Patterns for encapsulating speech interface design solutions for older adults." In Proceedings of the 2003 Conference on Universal Usability (Vancouver, British Columbia, Canada, November 10 - 11, 2003). CUU `03. New York: ACM Press, 2003, 54-60.
Dr. Sara Basson
Dr. Peter G. Fairweather
Dr. Vicki L. Hanson
About the Authors
Sara Basson works in IBM Research, where she is driving strategy and business opportunities for speech transcription technology. Sara holds an M.B.A. in management/marketing from New York University’s Stern School of Business, and a Ph.D. in speech and hearing sciences from The Graduate Center of the City University of New York. She was recently granted an honorary doctorate degree from St. Mary’s University in Halifax, Nova Scotia, in recognition of her extensive activities using speech technologies to enhance accessibility in the educational arena. She serves on the executive board of directors of AVIOS, a speech technology applications professional organization. Sara is also on the editorial board of the International Journal of Speech Technology.
Peter G. Fairweather is a research staff member in the accessibility research department at the IBM T.J. Watson Research Center, where he investigates methods to support task performance and to provide access to technology, particularly for persons with disabilities or deficits. Fairweather earned a doctorate in reading and language (psycholinguistics). His prize-winning dissertation concerned experimental validation of case grammars as means to describe certain parts of understanding and remembering textual material. He has published more than 50 papers in the area of learning technologies, assistive technologies, and accessibility, and has been awarded 18 patents in these areas. Currently he works with members of the accessibility group to develop a set of accessibility preference Web services to quantitatively distinguish how different types of people move through a website, and how best to realize different models or metaphors of pointing with cursors.
Vicki Hanson has been involved in working with people with disabilities for nearly 30 years. From 1978 to 1986, she conducted research in the areas of American Sign Language (ASL) and reading, first as a Postdoctoral Fellow at the Salk Institute and then as a Research Associate at Haskins Laboratories. She joined IBM in 1986 and currently manages the Accessibility Research group at IBM’s T. J. Watson Research Center. Holding a doctorate in cognitive psychology, her primary research areas consist of computer usability and Web access, reading and language processing in learning environments, and approaches to improving computer technology to make it more useful and usable for older adults. She is chair of ACM’s Special Interest Group on Accessible Computing (SIGACCESS) and has chaired their ASSETS’02 conference on assistive technologies. She is the founder and co-Editor-in-Chief of ACM Transactions on Accessible Computing and was named ACM Fellow in 2004 for her contributions to computing technologies for people with disabilities.
©2007 ACM 1072-5220/07/0700 $5.00
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2007 ACM, Inc.