Oana Inel, Lora Aroyo
The popularity of video on the Web is reaching an unimaginable scale—a person would need more than 5 million years to watch all videos available online up to 2020 . This scale poses an immense challenge to the discovery of relevant videos. Crucial for this are proper video categorization and textual descriptions such as tags, labels, and content annotations, which not only help with automated discovery but also help users assess relevance.
Many state-of-the-art approaches have been developed to automate the process of video understanding using textual components such as subtitles, titles, descriptions, or the video stream itself, by detecting and extracting noteworthy concepts such as people, organizations, places, objects, and events. These efforts resulted in numerous off-the-shelf tools—Clarifai, Google Cloud Video Intelligence, Amazon Rekognition, Microsoft Azure Computer Vision, and IBM Video Content Analysis, among others—thus streamlining the processing for a broad audience.
→ Video is the fastest-growing data modality on the Web, but retrieving relevant videos is still an open challenge.
→ Diversity of human perspective is central for understanding complex event semantics.
→ Replication of human annotation experiments gives more informative insights on data quality than the traditional inter-rater reliability metric.
As much as these efforts excel exponentially in accurately extracting entities from the text, audio, and video streams, there is still a big semantic gap between the human's and the machine's perception of the role, importance, and overall relevance of these concepts within the context of the whole video, and ultimately to the user's needs. Videos often have a storyline, refer to different timelines, and use explicit and implicit cross-references between concepts in the visual and audio streams that need to be understood to serve user needs adequately.
Assessing the actual relevance of all machine-extracted concepts from a user perspective is still approximated mainly with machine confidence or probability, since such concepts are rarely assigned relevance scores. However, the critical aspect of matching videos to users' interests, preferences, and goals is still far from being sufficiently efficient to be streamlined. Current tools for understanding and utilizing users' preferences in terms of relevance are primarily behavior (e.g., watching, sharing, or commenting on a video; clicking like or dislike) and popularity based (e.g., highly liked or shared videos). Still, such behavior-driven understanding is not sufficient to gain insights into which actual content in the video triggers people's attention.
Our studies revealed that events, locations, people, organizations, and general concepts of any type are fundamental elements for content exploration and understanding.
To understand what captures people's attention (i.e., what they find relevant), we focused on understanding the content of videos better. In information science, the concept of relevance is most connected to end users' judgments and is considered fundamental as a subjective, dynamic, user-centric perception. People might have or use different relevance standards or criteria when performing the task of video searching. Textual and visual criteria are essential for identifying relevant video content, but subjective, implicit criteria such as interest or familiarity could be equally used by people. Typically, people tend to connect bridges to concepts or perspectives that are not necessarily shown in the video, but that might be expressed or referred to.
We carried out a number of studies with news videos and broadcasts. In our initial study , we took a digital hermeneutics approach to understand which video aspects capture the attention of digital humanities scholars and drive the creation of narratives, or short audiovisual stories.
In subsequent studies, we focused on understanding the utility of machine-extracted video concepts and how people can teach machines in terms of video-concept relevance. We harnessed the intrinsic subjectivity of concept relevance to capture the diverse range of video concepts found relevant through the eyes of our participants . We explored to what extent current information-extraction systems meet users' goals, and what novel aspects users bring to video-concept-relevance assessment. We performed two types of crowdsourcing studies. The selection study (Figure 1) focused on understanding the utility of machine-extracted video concepts from video subtitles and video streams, while the free input study (Figure 2) focused on understanding the complementarity between machine and human concepts in terms of relevance. By studying the gap between machines and humans in terms of perceived video-concept relevance, we gained insights into how machines can collaborate with users to better support their needs and preferences.
|Figure 1. User interface of the crowdsourcing task where crowd annotators select relevant concepts in videos.|
|Figure 2. User interface of the crowdsourcing task where crowd annotators provide relevant concepts in videos.|
Our studies revealed that events, locations, people, organizations, and general concepts of any type are fundamental elements for content exploration and understanding. These are concepts that are most commonly extracted by machines and, as such, used in machine summarization of content and information search. Likewise, people engaging with online videos most often provide events, people, locations, and organizations as relevant concepts. Concepts of other types are also found relevant, but to a lesser extent. These concept types are thus fundamental for contextualizing the content of the videos, and also sufficient to capture human interest in terms of relevance.
Concepts extracted from video subtitles and the video stream typically are assigned a type (e.g., event, person, organization, location, or other), but their relevance is very rarely computed or retrieved by automated methods. The closest values that automated methods provide are salience scores for concepts extracted from the video subtitles and confidence scores for concepts extracted from the video stream. But these values are only tangentially related to relevance. Salience scores refer to the centrality of the concept in the text, while confidence scores refer to the accuracy of the detection.
Deeper analysis of salience and confidence scores showed that machine confidence is not synonymous with user-perceived relevance. Machines identify and extract concepts with various degrees of confidence and salience values. To some degree, concepts of the types event and people that are extracted with higher salience and, respectively, with higher confidence scores tend to be found more relevant by humans. However, the overall scarcity of events and their very low salience values make the salience score an unreliable proxy to distinguish between relevant and not relevant video concepts (Table 1). Our experiments also show that the absolute differences between machine confidence scores and human-perceived relevance are consistently large, even for concepts of type people. We clearly see that humans constantly disagree with machine confidence and salience, and that these scores cannot reliably estimate human-perceived concept relevance.
We observed a few reasons why machine confidence and salience scores cannot properly estimate the relevance of concepts, that we mainly link to data-centric aspects . Many computer vision and natural language processing datasets are more concerned with reliably detecting and identifying existing concepts, while the reliability of a dataset is most often computed in terms of agreement. We know, however, that relevance is a subjective notion, and that trying to make people agree on a small subset of relevant concepts would not be feasible. We hypothesize, thus, that such datasets do not incorporate or consider human-perceived relevance. The misalignment we identified could potentially come from the way the instructions were framed, the way the data was aggregated, how the reliability of the annotated concepts was computed, or simply due to how the model used the annotated concepts during training. The training datasets we refer to, however, are typically not accessible, which hinders further analysis.
Video subtitles were typically a rich source for named entities (e.g., Apollo 13) as well as for common entities (e.g., landing), thus providing both specific and generic concepts that are mentioned in a video. Concepts extracted from the video stream, on the other hand, were more likely to refer to common entities and be more general (e.g., interview, event, person). Our analysis shows that participants perceived as relevant both concepts extracted from video subtitles as well as from the video stream (Figure 3). Since there are very few concepts that appear in both the video subtitles and the video stream, this shows there is a complementarity between what is spoken and what is shown in a video. While this was observed for all concept types, the events extracted from the video stream were less likely to be selected as relevant, which indicates that video intelligence tools are not quite there in terms of understanding events. This is not surprising, however, given how difficult and vague events are, and how much both experts and crowd annotators disagree when dealing with events .
We also observed that the concepts provided by our human annotators as relevant were unlikely to be found among concepts extracted by machines. Study participants continuously contributed new relevant concepts, which were also discovered in significant proportions in the semantic concept space, such as synonyms, related phrases, and similar concepts. The least-discovered concept types through semantic expansion seem to be of the types people and organizations. Our human-machine gap analysis showed that humans typically infer certain concepts by analyzing the video content, but these concepts are not necessarily present in the video. For example, in a video about the "Rosetta comet landing," humans mention the European Space Agency, and in a video about the TV show Comedians in Cars Getting Coffee, the crowd provide the concept "Netflix," which is the producer of the show. Similarly, in a video about a movie release, people provide the names of actors who appear in the footage but are never named. What we observed is that many of these concepts are named entities, or proper nouns. In such cases, the semantic expansion is less likely to help uncover concepts of type people and organization, just as our analysis indicates.
The new concepts provided by the study participants were typically found in the long tail; not many people mention them as relevant. Among all concept types, events account for the most subjective and diverse human interpretations, as a given event is often provided as relevant by very few people or just one. The given events can be inferred from the shown footage (ramble, interview) or appear as overlay text in the video (climate change, drought). Even with the semantic expansion, events are the most challenging concept type for video understanding. For video analysis tools to better serve the needs of users, they should better capture the human-perceived semantic space of relevant video concepts, in particular those of events.
Our extensive analysis showed that commonly used metrics for reliability of a collected dataset, such as inter-rater reliability metrics, are inadequate when dealing with subjective annotations . The mere fact that humans disagree on the relevance of video concepts is not synonymous with a poorly collected dataset. We showed this by collecting human relevance judgments from highly divergent pools of annotations, at three points in time, at least three months apart. These repetitions showed that concepts' relevance across them is strongly and positively correlated. Furthermore, these repetitions also suggested that humans' preferences in terms of relevance change over time, and additional, novel concepts are found relevant at any moment in time. This implies that user preferences in terms of relevance need to be collected and updated over time, and the process should be a continuous one, for tools that utilize user preferences in terms of relevance to capture the most meaningful space.
In general, people are more likely to agree on concepts that are not found relevant, and in particular on events. A manual inspection of the events extracted by video analysis tools showed that many event concepts are too general, and appeared across videos (e.g., indoor games and sports, games, hair color), while others are completely wrong, potentially due to contextual information—for example, concepts like "protest," "riot," "sports," and "world rally championship" in a video about the Batman movie premiere, where many people had gathered to see the actors and a race car appeared on the red carpet.
An essential outcome of our study is that human-perceived video concept relevance cannot be captured through machine salience and confidence scores. People tend to agree substantially more on concepts that are not relevant, while potentially relevant concepts are hardly ever picked as relevant by all annotators in our experiments, emphasizing the inherent subjectivity of relevance assessment. Even after collecting the preferences of more than 50 annotators, we were still acquiring new relevant concepts. This implies that user preferences in terms of relevance need to be collected and updated over time for tools that utilize user preferences in terms of relevance to capture the most meaningful space. Furthermore, to account for more unique preferences, the focus should be on the long tail—the concepts perceived as relevant by fewer users.
1. Inel, O., Tintarev, N., and Aroyo, L. Eliciting user preferences for personalized explanations for video summaries. Proc. of the 28th ACM Conference on User Modeling, Adaptation and Personalization. ACM, New York, 2020, 98–106.
4. Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., and Aroyo, L. Everyone wants to do the model work, not the data work: Data cascades in high-stakes AI. Proc. of the 2021 CHI Conference on Human Factors in Computing Systems. ACM, New York, 2021, 1–15.
5. Inel, O. and Aroyo, L. Validation methodology for expert-annotated datasets: Event annotation case study. Proc. of 2nd Conference on Language, Data and Knowledge. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2019.
Oana Inel is a postdoctoral researcher in the Dynamic and Distributed Information Systems group at the University of Zurich. Previously, she was a postdoctoral researcher in the Web Information Systems group at Delft University of Technology, working on the responsible data science track in the Digital Society research program. She obtained her Ph.D. at Vrije Universiteit (VU) Amsterdam, the Netherlands. [email protected]
Lora Aroyo is a research scientist at Google Research's New York office currently working on human-labeled data quality. She is a four-time winner of the IBM Faculty Award for her work on CrowdTruth used in adapting the IBM Watson system to the medical domain and in capturing ambiguity in understanding misinformation. Prior to joining Google, she was a full professor in computer science at VU Amsterdam and was chief scientist at the New York-based start-up Tagasauris. [email protected]
Copyright 2022 held by owners/authors
The Digital Library is published by the Association for Computing Machinery. Copyright © 2022 ACM, Inc.