XXI.3 May-June 2014
Page: 50
Digital Citation

Real-time captioning with the crowd

Jeffrey Bigham, Walter Lasecki

back to top 

Automatically providing access to information for people with disabilities when and where they need it requires solving some of the most difficult problems in computing. A characteristic example that we have been looking at is the real-time conversion of speech to text. Real-time captioning allows deaf and hard of hearing (DHH) people to access the aural speech around them at interactive speeds (less than five seconds from spoken word to text caption). This is a vital accommodation in classrooms where deaf people need access to mainstream education.


Despite tremendous progress, automatic speech recognition (ASR) still cannot be used in classrooms because it is particularly unforgiving of errors that could confuse students. ASR stumbles on the unknown speakers, fast pace, and open domains that are common in lecture material. Training speech recognizers in advance is impractical for every speaker and acoustic environment, so the recognizers perform especially poorly when students decide at the last minute that they would like to attend a lecture (a privilege routinely exercised by hearing students). Even a 5 percent error rate isn't good enough—imagine learning computer science with every one in 20 words in a lecture replaced with nonsense!

back to top  Insights


As a result, the only providers of real-time captioning that satisfy legal requirements are expensive professionals who have trained for years to be able to type at natural speaking rates. These real-time stenographers are amazing at what they do. Through a combination of training, skill, and technology (including special keyboards and software), they are able to type at rates exceeding 200 words per minute. We recently got to see one of the fastest stenographers in the world caption sessions at the ACM Conference on Computers and Accessibility (ASSETS). He is capable of typing at speeds up to 350 words per minute while adjusting to the speaker's accent, speed, volume, background noises, and new words.

Since real-time stenography is difficult and there aren't many people who can do it well, stenographers rightfully command high wages and favorable scheduling terms. In a survey we conducted, the hourly rate for a stenographer ranged from $120/hour to $300/hour. Perhaps worse, they must be scheduled days in advance, and for an hour or more at a time.

We have been developing a system called Scribe, which democratizes the ability to provide this service by allowing anyone who can hear and type to help. The key insight is that we don't have just one person do it (it's too difficult!), but rather allow multiple non-experts to work together to perform this task as well, or even better, than highly trained professionals. This system lowers the barriers to entry faced by would-be captionists, and allows volunteers or untrained workers to immediately begin helping to provide support for DHH users. Even four work-study students paid $15/hour cost half the price of the least costly professional stenographer.

Scribe is a crowd-powered system that flexibly handles people coming and going as they contribute to the task. Our vision of crowdsourcing is greater than one of cheap labor—it involves people coming together to do something that no one individual could do alone.

The challenge then is one of coordinating the work. As is often the case in real-time crowd-powered systems, Scribe uses a careful combination of people and supporting automation. People have superior common sense and perception, but computers are fast. The trick is keeping the speed of computers while leveraging the intelligence of the human workers.

back to top  Scribe

We introduced Scribe with the goal of improving the state of real-time captioning in terms of quality, availability, and price. To do this, we lowered the barrier to entry for new captionists who want to help provide this accommodation. Unlike prior approaches, Scribe leverages groups of non-expert workers to reliably generate real-time captioning collectively, instead of requiring one highly skilled individual (Figure 1) [1].

The average untrained person is able to caption between roughly one-fifth and one-third of the words spoken by a typical speaker. This means they generally are not able to capture enough information for a user to have a good sense of what is being said. However, at those rates, it is possible that three to five workers would collectively be capable of capturing everything a speaker says.

But generating three to five separate partial captions, each streaming to and being displayed for the user in real time, makes for an output that is very difficult to read. Figure 2 shows an example of what the output from five workers would look like. While the final sentence, "We have a crystal that has a two-fold axis," can be pieced together, it takes much more time than just reading a line of text, and it becomes even more challenging when the content is streaming at the speed of speech. Not only do people fall behind while reading, but they also have to spend a significant amount of their attention on piecing sentences back together instead of following the content of the speech. Worse, there are often cases in which the words appear out of order, and piecing together the true meaning is even more difficult.

Scribe allows users to avoid this mess by synthesizing a single output from the multiple captions, intelligently recombining workers' input on the fly. Once these captions are combined, they can be forwarded back to the user and presented just as any other captions would be. For this entire process to take less than five seconds, very quick methods for recombining workers' answers are required. As we will see here, this is a complex problem, but approximations can be used that allow us to merge workers' input in under a second, keeping the total turnaround time to under five seconds.

back to top  Captioning with Groups

Combining the partial captions created by dynamic and highly varied groups of workers presents a number of challenges. Workers each type at different speeds and capture different content, so timestamps do little to clarify the exact order of two workers' input relative to each another. Language is often ambiguous and can be interleaved in multiple valid but incorrect ways. This is made even harder by the fact that spoken language is often not grammatically correct, so the resulting text cannot be assumed to follow any formal set of rules.

To tackle the combination problem, we borrow a page from computational biology. We use a multiple sequence alignment (MSA) algorithm, often used to align genomes in DNA, to determine the correct interleaving of the captions, and then optimally recombine them into a single result. By first finding the best alignment of words, we are not only able to determine the order in which to join them together, but we can also find and remove errors by comparing the versions of a word typed by different workers [2]. By using an A*-Search-based approach, we are even able to find solutions within specified optimality bounds and time constraints (Figure 3) [3].

back to top  Improving Collective Output

Our fundamental insight with Scribe is that by working together, workers can make the task easier for one another. This same idea can be extended to more aspects of the captioning process to get even better results, meaning workers can transcend simple capability to actually be proficient at the task. We highlight two examples of workflows that lead to this kind of improvement.

First, we automatically assign worker roles. The amount of content a worker hears can quickly get overwhelming to try to type. Furthermore, there are natural variances in the saliency of different words in speech, meaning that workers were more likely to caption very salient portions, such as the first word of a sentence after a pause, while being less likely to caption a word from the middle of a long sentence.

In Scribe, each worker listens to roughly three seconds of audio and then has a break (typically nine to 15 seconds) that allows them to finish typing what they heard before they are asked to type another segment. We give workers visual and auditory cues for when they should be typing; to allow workers to maintain context, we lower the volume of the parts of the audio for which they are not responsible, but do not mute it entirely. By having at least one worker typing at any given point in time, the stream of captions remains unbroken.

The second approach we use is adjusting playback speed. Even when workers are responsible only for limited segments of the audio, the speed of the content in those segments keeps the task challenging. When captioning non-real-time content, captionists often slow down the audio so that they can keep up with what they hear. This isn't possible in the real-time setting, though, since the captions would immediately begin to fall behind. However, because there are multiple workers, we can slow down the part of the audio that a worker is responsible for, while slightly speeding up the part they need only to listen to (Figure 4). This is not detrimental because people can listen faster than they can type.

Reducing the playback speed allows workers to caption more accurately, capture more of the words they hear, and, perhaps most interestingly, complete their task faster than they would if the audio was played at full speed [4]. This latency improvement is due to workers changing how they caption. Instead of listening to the whole segment of audio and then beginning to type, they have time to type each word as it is said.

Both of these workflow adjustments improved worker performance in terms of both coverage (how many spoken words they captioned) and precision (how many typed words were correct), as well as reduced the stress felt by workers when captioning. Assigning workers roles allows them to reasonably type everything they hear in most segments, making them feel as if their task was accomplished successfully. Slowing down audio playback lets workers listen to the audio and type at a more relaxed pace instead of having to work in frantic bursts.

back to top  Leveraging the Crowd and Democratizing Assistance

Since Scribe can support even error-prone workers, it is also able to take advantage of the crowd—large groups of remote workers accessed via marketplaces like Amazon Mechanical Turk. These workers frequently come and go, vary significantly in their abilities, and cannot always be relied upon to provide a helpful or correct answer, but are available on demand, allowing Scribe to be run anytime, anywhere (even on platforms such as Google Glass), without the need for advance scheduling.

Our vision of crowdsourcing is greater than one of cheap labor—it involves people coming together to do something that no one individual could do alone.

More broadly, by using enough workers to have overlapping coverage, the system can compensate for errors and missing content. This allows anyone to contribute to the captioning task immediately, where they previously would have had to train as a professional captionist. Untrained workers can now provide an affordable service while still earning livable wages, and even family, friends, co-workers, fellow students, and others can begin to contribute to providing access to users. By using human intelligence in a way that allows anyone, not just experts, to contribute to the operation of access technology, a reliable, on-demand solution can be provided for low cost—supporting user independence in ways not previously possible.

back to top  Conclusion

The future envisioned by our research is one in which robust intelligent access technology is brought about for the benefit of people in their everyday lives. Our group has found particular promise in building these systems today by leveraging the on-demand labor of people to fill in for components that we cannot currently automate. These hybrid systems allow groups to do together what even expert individuals cannot do alone. They can also provide the scaffolding necessary to train fully automated approaches to solve the difficult computational problems that these systems address, which will help to further improve cost, scalability, and performance over time.

back to top  Acknowledgments

This work is funded by the National Science Foundation, Google, and a Microsoft Research Ph.D. Fellowship.

back to top  References

1. Lasecki, W.S., Miller, C.D., Sadilek, A., Abumoussa, A., Borrello, D., Kushalnagar, R., and Bigham, J.P. Real-time captioning by groups of non-experts. Proc. of ACM Symposium on User Interface Science and Technology. ACM, New York, 2012.

2. Lasecki, W.S. and Bigham, J.P. Online quality control for real-time crowd captioning. Proc. of ACM SIGACCESS Conference on Computers and Accessibility. ACM, New York, 2012.

3. Naim, I., Gildea, D., Lasecki, W.S., and Bigham, J.P. Text alignment for real-time crowd captioning. Proc. of North American Chapter of the Association for Computational Linguistics Conference. 2013.

4. Lasecki, W.S., Miller, C.D., and Bigham, J.P. Warping time for more effective real-time crowdsourcing. Proc. of ACM Conference on Human Factors in Computing Systems. ACM, New York, 2013.

back to top  Authors

Walter S. Lasecki is a computer science Ph.D. candidate at the University of Rochester and a visiting Ph.D. researcher at Carnegie Mellon University. He creates interactive intelligent systems that are robust enough to be used in real-world settings by combining both human and machine intelligence to exceed the capabilities of either.

Jeffrey P. Bigham is an associate professor in the Human-Computer Interaction Institute at Carnegie Mellon University. His work is at the intersection of human-computer interaction, human computation, and artificial intelligence, with a focus on developing innovative technology that serves people with disabilities in their everyday lives.

back to top  Figures

F1Figure 1. The Legion Scribe system. Audio is captured from a user's mobile device or laptop and streamed to a group of non-expert workers, who each caption what they can from the audio. Then these partial captions are combined into a single caption and sent back to the user.

F2Figure 2. An example of the type of raw input collected from automatic speech recognition (ASR), as well as five non-expert human captionists. The captionists collectively capture all of the words, but combining all their input on the fly is burdensome for users, so Scribe merges workers' partial captions into a single final stream that can be shown to users.

F3Figure 3. Using multiple sequence alignment (MSA), we are able to optimally align input from multiple workers, then combine it into a single caption that is easy for the user to read.

F4Figure 4. By decreasing the audio playback speed for sections workers are responsible for typing, workers can caption more accurately, completely, and quickly.

back to top 

Copyright Held by Authors. Publication Rights Licensed to ACM.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2014 ACM, Inc.

Post Comment

No Comments Found