Graham Pullin, Andrew Cook
There are some commonly held assumptions about the future of speech technology, and our interactions with it, that we wish to challenge. Our goal with "Six Speaking Chairs" is to explore alternative perspectives rather than to converge on solutions at this first stage. We would therefore describe this activity as design research, even though it involves our practice as interaction designers.
We have built a collection of objects, each of which embodies a different way of thinking about tone of voice. It is better to view the chairs as provocations than as prototypes in the conventional sense, because their most important role is to spark discussion rather than to test proposals. The chairs are not to be taken literally. It's not about chairs at allthey are really just illustrations of principles, more so even than sketches of user interfaces. Each illustrates what might otherwise be a rather esoteric mental model, making it accessible to experts and non-experts alike and engaging both in conversation and reflection .
This is a project about more expressive interactions with speech technology, specifically interacting with the tone of voice of synthesized speech. We hesitate to use the term "emotional" speech synthesis because, as we will explain, we are far more interested in the complex nuances of everyday speech than basic emotions such as sadness and fear.
As interaction designers, our focus is not on how to produce different tones of voice with speech technology itself. (We know there are more expert researchers looking into this, and we are collaborating with a world-leading research center on another project). Instead, we are exploring the implications for a user interface: How might someone who is not a speech technologist conceive of tone of voice in the first place, and therefore select or control it? This is a challenging question.
Without any further preamble, we will unfold the story of the project through the objects themselves, introducing the background, rationale, and inspiration along the way...
Chair No. 1. The Exclaiming/Questioning Chair, the first chair in the collection, is a reclaimed wooden kitchen chair; a plain charcoal-gray box that extends to one side has been fitted beneath the seat. Set into the top surface of this box are three keys from a computer keyboard, marked with a period (full stop), an exclamation point, and a question mark. An old-fashioned metal horn loudspeaker projects from the front edge of the box. While sitting in the chair, if you press the period key, the loudspeaker emits the word "yes" in a level tone. Press the "?" key and "yes" is delivered with a rising, questioning intonation. Pressing the exclamation-point key elicits a louder, more emphatic delivery.
... and the limitations of Text-to-Speech
These three ways to say "yes" represent the common flexibility of everyday Text-to-Speech (TTS) technology. Yet in our own speech, we employ much greater and more subtle variation in our tone of voice. In any sentencebut in particular in the short interjections at the heart of informal conversationit is not just what we say, but how we say it, that counts. Whereas in writing, "yes" is usually affirmative, in speech its meaning can be more complex, more layered, and more sophisticated. With the right tone of voice, we can either say "yes" to agree, to reassure, to bide for time, or even to undermine (without being so blunt as to actually say "no") .
TTS is found in the screen-reading software used by many visually impaired people, in other eyes-free interfaces (such as Apple's iPod Shuffle), and in automated telephone answering services. But its most profound application is in communication devices used by people who cannot speak. And it is here that the limitations of TTS can be most disabling, because a lack of variation in tone of voice can never be neutral. A lack of expressiveness can itself send out a false message that the person is emotionally impaired as well as speech-impaired, or perhaps socially unsophisticated. Writing and speaking are fundamentally different ways of conveying language, and yet TTS treats them as if they were equivalent.
Chair No. 2. The Happy/Sad Chair, illustrates an alternative approach. On a reclaimed wooden dining chair, a tuning dial (from a 1950s Bush radio) has been relabeled, the international radio stations replaced by a two-dimensional mapping of emotions, taken from psychological research . Inside the box a potentiometer registers the rotation of the tuning dial, and a separate slider controls the degree of emotion. These inputs drive a parametric model of prosody, using granular synthesis and formant resynthesis in Max/MSPa flexible "experience prototype" standing in for more sophisticated state-of-the-art speech technology . (Our speech technology, while flexible and capable of a high level of nuance and real-time control, sounds far from realistically human. Besides their iconic visual representation of spoken announcements, the low-fidelity sound from the metal horn speakers accentuates the highly artificial sound of the speech. The emphasis is not on its realism, but its expressiveness.)
... and so-called emotional speech
Of course the limitations of TTS have not escaped the developers of speech technology. There are many research projects (and have been for some time) investing synthetic speech with different emotions . The standard approach remains to create the most extreme emotionsusually defined as joy, sadness, anger and fearand to interpolate any other emotion between these four and a neutral tone of voice. The Happy/Sad Chair illustrates just this approach.
Manufacturers of AAC (augmentative and alternative communication) devices are expecting speech technology to support this kind of emotional speech in the future. So a more conventional assistive technology project might have taken this emotional model as a given and designed in more detail how this might be translated into a usable interface (even if some assistive technology researchers predict that emotion could be sensed automatically with biosensors, supposedly obviating the need for a conscious user interface at all). But our intent is to challenge these assumptionswe (and other researchers, as the subsequent chairs show) believe that expression of emotion is only one aspect of tone of voice. Not to even mention the moral implications of providing someone with a device that uncontrollably broadcasts their emotional state through tone of voice.
We believe that designers can play a valuable role in disability-related design, not only in refining clinically and technically driven solutions, but also in provoking discussion about the very role of assistive technology, a theme explored in Design Meets Disability . So Chair No. 2 is not the end of the story.
Chair No. 3. The Offering/Seeking Chair goes beyond emotions. We have reservations about using the phrases "emotional speech" and "expressive speech" interchangeably. Our emotions are just one of the things that we express through tone of voice, and even then, the situation is more complicated: when we try to suppress our emotions but they are still discernable, when we feign an emotion, or when two emotions are combined or even in conflict.
Chair No. 3, the Offering/Seeking Chair, is based not just on the emotions of the speaker, but also around the relationship they have with their conversational partner, the social context in which they find themselves, and their conversational intent. It is built on the work of Nick Campbell, a speech corpora researcher . The chair's interface is a series of six toggle switches: two to register the social relationship between conversational partners; two to reflect the conversational intent of any individual utterance. This leaves two switches to register a total of just four emotional states.
... and challenging existing paradigms
We are interested in understanding more about the relative strengths and weaknesses of this approach when compared with the emotional model. The overall number of tones of voice, and therefore the cognitive overhead, is not too dissimilar. But we are gaining complementary sensitivity at the expense of fine emotional control.
Our interactive prototypes allow each to be deployed within a conversational context to assess their effectivenessnot only their clarity or ambiguity, but also the conversational influence and creative expression that they afford. These contexts have to be carefully crafted, given that the chairs are limited to just four words for prototyping reasons. There is a secondary control on the other side of the seat, selecting "no," "really," or "hello" in favor of the default "yes," all four chosen for the importance of tone of voice in their meaning and for their role in engaging and sustaining conversation. This is not without precedent in the study of AAC: Some people with aphasia lose their vocabulary yet nonetheless manage to influence and direct a conversation .
Chair No. 4. The Rising/Falling Chair dispenses with intermediate parameters and gives direct control of the speech sounds: The user traces the desired pitch contour on a flat surface (the touchscreen of a Nintendo DS embedded in the chair) and the intonation and timing of the synthesised speech follows this in real time. This is based on intonation diagrams that phoneticians use to record speech tones , turned around to define rather than transcribe. We have been surprised by how intuitive this can prove even for people not trained in phonetics. An exercise in hand-ear coordination that yields ever greater levels of control and nuance with practice, it is akin to learning a musical instrument.
... and curating design collections
This manipulation of the voice builds on Pullin's "Speaking Mobile," in which the intonation of synthesised speech was controlled with a thumb joystick (itself inspired by Kempelen's Speaking Machine" ). This radical mobile phone allowed expressive telephone conversation to be silently and discreetly held in public places, without disturbing other people. It was one of five concepts in "Social Mobiles," a collaboration between IDEO and Crispin Jones led by Jones and Pullin . This critical design project provoked discussion about the anger and frustration caused by other people's mobile phones .
"Social Mobiles" inspired the format of the "Six Speaking Chairs" as a design collection. A design collection, by which we mean a series of designs conceived to be considered collectively, can provoke more divergent discussion than more conventional design concepts. The allusion to museum collections is appropriate: The six chairs each represent a way of thinking about speech encountered on our interdisciplinary travels and brought back home. Our mental models, like the physical controls and the chairs themselves, are all found objects, reappropriated. Their physical formthe anonymous chairs, the plain gray boxes, and the familiar controls out of contexis designed to reflect this, designed to look not designed as much as displayed. The project is as much an exercise in curation as creation .
Chair No. 5. The Reassuring/Undermining Chair offers more abstract control. The user hits a drum pad, and the way in which it is hit and the material with which it is hit determine the timbre and intonation of the utterance. Confident hitting will trigger "yes" and "no" responses in different tones of voice; lighter tapping will produce the type of paralinguistic sounds (for instance, "uh-huh" and "ye-yeah") that lubricate conversation, encouraging or undermining the other person while they are speaking without interrupting them, yet which are not even part of current TTS systems.
As with any musical instrument, practice would be required to explore this relationship between action and sound. We are interested in how accessible, rewarding, and expressive this might be for people with differing musical abilities. This approach has been informed and inspired by Cook's "Tactophonics", an exploration of intuitive interaction with computer music underpinned by the concepts of affordance and expressive performance through objects as varied and unexpected as tree branches and baseball bats .
... and materiality in interaction design
This has also led us to consider a role for physical materials as part of the user interface. How might the qualities of materials translate into voice quality? Waxy, woolly, glassy, porcelain, dark chocolate, even "rich fruitcake" might all bring to mind a particular voice quality. Here, as elsewhere in interaction design, material qualities offer underexplored opportunities for associations that are familiar and intuitive, yet abstract and open to interpretation.
Chair No. 6. Lastly, the Terse/Roaring Chair has 17 doorbells, each of which offers a different description of tone of voice, from coaxing to coyly, from whimpering to whispering. These descriptors are taken from stage directions by the playwright George Bernard Shaw .
In communication devices, a selection rather than a manipulation may be more appropriate after all, since speech impairment is so often associated with other physical impairments. But 17 choices are many more than existing AAC devices. The "Tango!" by Blink Twice , in many ways the state of the art, allows a child to speak, shout, whisper, and whine (note that none of these are emotions). This is wonderfully expressive for a young child, but for adults, tone of voice is richer still and more finely nuancedand more personal.
... and 17 ways to say yes
Alongside the doorbells is a recess containing a white pencil with the simple instruction: "please customize," which hints that we intend people to choose their own 17 ways. We have asked AAC users and other experts which tones they would choose, were they to be limited to just (just!) 17.
Shaw's descriptions are more heterogeneous than any of the mental models we have borrowed from academic sources. The unified model of the academic is eschewed in favor of the pragmatism of the practitioner, whose goal after all is practical, rather than theoretical. This of course is also the motivation of our future users, and we are not surprised to see this degree of heterogeneity in their responses, too. They showed a level of diversityboth between people and within individuals' 17 choicesthat starts to show that any single mental model may never be sufficient.
We believe the challenge ahead, on the project that is already following this , is to embrace this complexity of coexisting models while maintaining simplicity of interaction. It does sound very challengingbut at the same time, it is an appropriate contribution for interaction designers to make to this fascinating interdisciplinary area.
1. Pullin, G. and Cook, A. "Six Speaking Chairs." August 2008. http://www.imd.dundee.ac.uk/sixspeaking-chairs/
11. IDEO. "Social Mobiles." 2002. http://www.ideo.com/work/item/social-mobiles/
16. DynaVox. Tango! 2009. http://www.dynavoxtech.com/products/tango/
17. Speaking Unit. "Speech Hedge." July 2010. http://speakingunit.co.uk/speech-hedge/
Graham Pullin is the course director of Digital Interaction Design at the University of Dundee in Scotland, author of Design Meets Disability, and curator of the Museum of Lost Interactions (MoLI). He spent nine years at IDEO as an interaction designer, project leader, and studio head.
Andrew Cook is a graduate of Interactive Media Design (the previous name of Digital Interaction Design) and an interaction designer currently completing doctoral research. He is also a musician under the moniker Samoyed and co-founder of the record label flask.
©2010 ACM 1072-5220/10/0900 $10.00
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2010 ACM, Inc.