Department of Information Systems Engineering
Kochi University of Technology
Department of Communication Engineering
Tokyo Denki University
Pen-based systems with small, touch-sensitive screens are established in the computer market, but some basic issues have been neglected. Selection strategies for interactive systems are usually chosen without quantitative consideration. In small, pen-based systems, targets get smaller as the amount of information required on the screen increases. We therefore evaluated selection strategies for small targets and quantified the factors that affect the efficiency of strategies on pen-based systems.
We developed two state-transition models that efficiently describe the various states and stages in 2-D and 3-D selection strategies. These include all the normal conditions for these types of pen-based systems: pen in contact with the screen, pen not in contact with the screen; pen switched on, pen switched off, pen inside the target area, and pen outside the target area. The models can be easily modified and adapted to describe or design strategies for touch-sensitive-type tablets and electromagnetic (3-D) tablets (See Figure 1).
We conducted two experiments to determine the best individual strategy and the best strategy group. We tested and analyzed six strategies that we felt were a balanced representation of 2-D and 3-D strategies in general. The strategies were compared individually and in groups, according to their essential characteristics. We also investigated the effects of variables (pen-movement-distance, pen-movement-direction, and target size) and the relationships between interaction states, routes, and strategy efficiency.
The results showed that the "Slide Touch" strategy was significantly better overall than the other strategies. In this strategy the target is selected at the moment the pen tip touches the target for the first time after landing on the screen surface. The "In-Out" strategy group was more efficient than the other groups, particularly in situations in which other potential targets do not exist near the target. In this group the target is selected by contact either inside or outside the target.
We found that the differences between strategies are not affected by the distance to the target or the direction of pen movement; however, the differences between strategies are influenced by variations in target size. We also found that the "smallest maximum target size" of five pixels (1.80-mm-diameter circle) is the boundary value. When a target is fewer than five pixels it is necessary to pay particular attention to the choice of strategies because the error rates for the strategies are affected differently. But the error rate for all strategies when using a target size greater than five pixels is not significantly affected.
The state-transition models helped us identify two principles that make one strategy more efficient than another, similar strategy: (1) when the initial and the final states are the same, the selection strategy that has more possible routes is the more efficient of the two, and (2) when the initial and final states of any two strategies are the same but one of the strategies has more possible final states, the selection strategy that has more final states is more efficient.
These observations establish priorities in hardware and software design, particularly in designing new and better selection strategies.
Steve Benford, Chris Greenhalgh and Mike Craven
Department of Computer Science
The University of Nottingham
Nottingham, United Kingdom
firstname.lastname@example.org, email@example.com, firstname.lastname@example.org
BT Adastral Park
Martlesham Heath, United Kingdom
London, United Kingdom
Since 1996 we have explored the idea of producing online television shows in collaborative virtual environments (CVE), an idea that we call "inhabited television" . Members of the public and professional actors meet online in a shared virtual world where, through their avatars, they take part in a television show. Invisible virtual cameras within the world capture the action, which is then broadcast to conventional television viewers. There are therefore two ways to experience an inhabited television show: as a normal viewer, which is like watching an improvised animation, or as an "inhabitant," metaphorically stepping through the television screen to become part of the show.
Collectively we have staged several experimental inhabited television shows . Three early experiments provided valuable insight into the difficulties of achieving the kind of fast-paced, highly coordinated online action that is required to make television. In the NOWninety6 poetry performance (Figure 1), poets performed simultaneously to an online audience of avatars in a virtual world and also to a live theatre audience. The Mirror was a collaboration of Illuminations Television, British Telecom (BT), British Broadcasting Corporation, and Sony. It involved establishing an online community of avatars running alongside a conventional television series. The conventional TV shows provided inspiration for events within the worlds. Edited highlights from the worlds were shown on subsequent programs, Heaven and Hell Live!, a collaboration of Illuminations, BT, Channel 4, and Sony, was an attempt to stage an hour-long game show in a virtual world that was broadcast live on national television in the United Kingdom.
These early experiments raised some fundamental problems for inhabited TV and online performances in general. Can members of the public become successfully involved in an online TV show? Is it possible to establish fast-paced social interaction within a CVE? Is it then possible to produce a coherent and engaging broadcast of this action for external viewers?
A fourth, more recent experiment, Out of This World (OOTW) addressed these questions head on. Like Heaven and Hell Live!, OOTW was a live game show. Through a combination of world design and game design principles, and dedicated production tools, we were able to create a 45-minute-long show that had a coherent structure and a relatively fast pace and that resulted in a broadly coherent broadcast. The production software included new virtual camera control interfaces for human camera operators as well as a participant management interface for a "world manager" that could be used to shepherd participants to key locations at key moments .
However, in spite of its coherence and structure, discussions with viewing audiences (OOTW was "broadcast" into a theatre) revealed problems with the content of OOTW. It was criticized for its clichéd format, and audiences found it difficult to empathize with its characters and situations. Subsequent work has focused on creating more dramatic forms of inhabited television, involving members of the public in richer roles and more complex spatially distributed narratives. Our most recent experiment, Avatar Farm, was a two-hour-long drama that involved four members of an existing online community playing their established characters in an improvised drama alongside seven professional actors . Avatar Farm was recorded using a new 3-D recording technique that allows the action within a collaborative virtual environment to be logged and then subsequently replayed in a live virtual world so that a viewing audience can explore it repeatedly, following different characters and viewing it from any angle . Figure 2 shows an example in which viewers interact with a bird's-eye map of a virtual world that is projected onto a tabletop in order to select detailed in-world views that are then projected onto a nearby vertical screen. The recordings can also be exported to conventional animation packages and then rendered offline to produce linear television broadcasts.
Our various experiments in inhabited television have led us to explore practically the issues of online performance. Whatever the merits or otherwise of inhabited TV as a future genre for CVEs, these experiments have shed new light on creating fast-paced online action involving participants in interactive dramas and addressing the need for new production and management techniques and toolkits.
Interaction and Presentation Laboratory
Royal Institute of Technology
Kirsten Rassmus-Gröhn and Calle Sjöström
Center for Rehabilation Engineering Research
Rapid technological development in multimodal interfaces now enables interactions that support vision, hearing, and touch. Although interfaces for single users exist that support the touch modality, most collaborative environments developed today mainly support vision and hearing.
Research has shown that people perceive that, to varying degrees, media can give them a feeling of being socially present with other persons at a remote location. Video-conferences are, for example, generally perceived as more social than e-mail. People also perceive that media, to varying degrees, can mediate a subjective experience of being in a remote place or environment, even when one is physically situated in another. This is referred to as "virtual presence."
The aim of the experimental study presented here was to test how touch feedback in a collaborative 3-D desktop virtual environment will affect perceived social presence, perceived virtual presence, perceived task performance, and task performance.
Twenty-eight test subjects, all students at Lund University, participated in the test. The PHANToM, a one-point, high-resolution haptic device was used for touch feedback, and a specially developed program provided the virtual environment. Headsets provided audio communication between the users through a telephone connection. The program enables two individuals placed in different locations to simultaneously feel and manipulate dynamic objects in a shared virtual environment. The objects in the environment are modeled to simulate cubes with form, mass, damping, and surface friction (See Figure 1).
In the virtual shared environment the test subjects were instructed to perform five tasks that involved lifting and moving eight cubes in order to produce different constructions. Half the test pairs performed the tasks with touch feedback and the other half performed them without touch feedback. None of the test subjects had previous experience with touch feedback devices.
The subjective measures were obtained through questionnaires. The objective measure of task performance was obtained by measuring the time required to perform the five tasks.
The subjects' mean rating of perceived virtual presence was significantly higher using touch feedback condition (M=174) than without it (M=142). In this test there was no significant evidence in the results of increased perceived social presence because of touch feedback. Results also showed that touch feedback significantly increased task performance: tasks were completed in less time with touch feedback (M=24 minutes) than without it (M=35 minutes). Also, test subjects perceived themselves as performing tasks significantly better with touch feedback (M=83) than without it (M=71). These results show that adding touch feedback in desktop virtual environments has advantages when the task is to manipulate virtual objects collaboratively.
Speech and Language Processing
70 Fawcett Street
Cambridge, MA 02138
Fax: (617) 873-2543
HumanComputer Interaction Institute
School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213-3891
Interactive Systems Laboratories
School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15221
Fax: (412) 268-5578
Although commercial continuous speech dictation systems for personal computers have been available for more than two years, speech recognition errors remain a serious problem in the design and implementation of dictation systems and speech user interfaces. Recent studies illustrate the seriousness of the problem. Vendors of commercial dictation systems claim input rates of 100 or more words per minute, but user studies on commercial dictation systems show that even with extended practice, users achieve at most 30 wpm, if time lost in error correction is included. Initially, text production rates with the dictation system are much slower. Previous research hypothesized that switching modality could speed up interactive correction of recognition errors. To test this hypothesis with a real speech user interface, we implemented multimodal correction in a prototype multimodal dictation system. Our implementation of multimodal correction allows the user to switch from (continuous) speech to handwriting, pen-based gestures, choosing from alternative interpretations, or keyboard input. We maximized accuracy of multimodal correction by developing novel recognition algorithms that use contextual information for recognizing correction input. Figure 1 is a snapshot of the prototype's graphical user interface.
We evaluated multimodal error correction in a user study with 15 participants. The study showed that multimodal error correction is more accurate and faster than unimodal correction by respeaking. Beyond dictation systems, our results suggest that unimodal correction in general is inefficient because correction accuracy decreases in repeated correction attempts, unless modality is switched. In support of this claim, Figure 2 shows correction accuracy in repeated attempts in the same modality, across various modalities.
Among different multimodal correction methods, we observed a range of correction speeds. Which multimodal correction method is most efficient depends on the individual user. For skilled typists, multimodal correction by keyboard and mouse input is faster than correction by speech and pen input.
Our study also examines user choice between modalities. Although users initially prefer speech for error correction, they eventually learn to switch to more efficient modalities, after repeated evidence that certain modalities are inefficient for correction (that is, respeaking and choosing from alternatives). In other words, recognition accuracy ultimately determines a user's choice of modality. But leading users to quickly identify efficient interaction strategies is a difficult design problem. This is especially true for walk-up-and-use applications, for which the designer cannot rely on the rational user's ability to learn which modalities are most efficient for a given task.
Modeling techniques can be applied to extrapolate the results of user studies to future recognition performance. We present a performance model of multimodal interaction that predicts input and correction speed of unimodal and multimodal interaction according to a few standard recognition performance parameters. Applied to the problem of multimodal error correction, model predictions suggest that multimodal correction by speech and pen input could outperform correction by keyboard and mouse input for all users with modest improvements in recognition accuracy. Although it is preliminary, our model is a first step toward formalizing multimodal interaction. Model predictions effectively complement results from the user study and help answer important design decisions in speech user interfaces. More generally, our research demonstrates the power of complementing component-level benchmark evaluations, as common practice in the field of speech recognition and multimodal interfaces, with both user studies and modeling techniques.
Implications of this research for dictation systems and other speech user interfaces include the following:
- For most users of dictation systems, the most efficient strategy to generate text is to dictate first, then to correct all recognition errors using keyboard and mouse.
- Any speech user interface benefits greatly from the capability to switch modality for error correction. If speech is the only modality available, such as in telephone user interfaces, switching between different types of speech input can help, for example, from continuous speech to spelled speech.
- Recognition accuracy ultimately drives a user's choice of modality. However, users initially choose inefficient correction modalities.
- Modeling techniques can help evaluate the trade-off between different input modalities, thus helping during the design of novel speech (and multimodal) user interfaces.
1. Benford, S.D., Greenhalgh, C.M., Brown, C.C., Walker, G., Regan, T., Rea, P., Morphett, J., and Wyver, J. Experiments in Inhabited TV. CHI'98 Conference Companion (late breaking results), Los Angeles, April 1998, ACM Press.
2. Benford, S.D., Greenhalgh, C.M., Craven, M., Walker, G., Regan T., Morphett, J., and Wyver, J. Inhabited Television: Broadcasting interaction from within collaborative virtual environments. ACM Transactions on Computer Human Interaction, special issue on collaborative virtual environments, forthcoming.
Figure 1. State-transition models describing 2-D and 3-D target selection with a stylus pen. The ellipses illustrate 2-D targets, or the bottom of 3-D targets, on the tablet screen surface. The cylinders with dashed lines show the body of a 3-D target. States a, b, and c are for 2-D targets, in which case there is no need to consider the cylinder. The line arrows show the transition between two states, which may be in either direction. The short lines under the pen tip (in b, c) show the pen tip in contact with the screen (the pen is switched on by contact with the screen). State a: pen outside or above the 2-D plane, or outside the 3-D target, pen tip switched off (pen not in contact with the screen); state b: pen tip is outside the 2-D or 3-D target, switched on (pen in contact with the screen); state c: pen tip is inside the 2-D or 3-D target, switched on (pen in contact with the screen); state d: inside the 3-D target but not in contact with the screen and therefore switched off. If we assume for example that state a is an initial state and c is a final state, the state transition route may be either a b c or a c.
Figure 6. Deterioration of correction accuracy in unimodal repair (i.e., repeating in the same modality). Note that in the context of correction after dictation, spoken input was already misrecognized once. Therefore the user's first attempt to correct by respeaking is the second attempt using the speech modality; hence, the low accuracy compared with all other modalities.
©2001 ACM 1072-5220/01/0100 $5.00
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2001 ACM, Inc.