Automated Video Description for Blind and Low Vision Users

Bodi, Aditya; Fazli, Pooyan; Ihorn, Shasta; Siu, Yue-Ting; Scott, Andrew Taylor; Narins, Lothar; Kant, Yash; Das, Abhishek; Yoon, Ilmi

doi:10.1145/3411763.3451810

Cited by 7 publications

(6 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…• Level of Detail: Prior work has investigated the potential for visual question answering systems to enable users to query for details that they wish to know [6,58,110]. As AI advances, it may one day be possible to provide end users with high degrees of flexibility for which details and what level of detail they would like through automatically generated descriptions.…”

Section: Applying Generative Ai To Explore the Video Accessibility De...mentioning

confidence: 99%

“…Other research has shown that BLV people wish to interact and engage with video content in ways other than only listening to preset neutral descriptions during the video itself. For example, Stangl et al [110] and Bodi et al [6] investigated the viability of providing video access through interactive visual question answering, reinforcing the importance of BLV users having agency in the process of making videos accessible. Others explored the impact of changing the tone or style of verbal descriptions for select video types, finding that alternative AD styles were engaging for BLV users [30,59,121,125].…”

Section: Video Accessibilitymentioning

confidence: 99%

“…Some research has focused on the development of datasets and NLP techniques for video understanding and accessibility [42,47,48,115,131,138]. Others have developed AI-based tools to support accessibility practices [6,11,66,93,110,126,134,135]. Major advancements in multi-modal language models such as OpenAI's GPT-4V [90, 91] and Google's Gemini [24,116] show that AI is already capable of generating image descriptions, and some video descriptions, that attain high levels of quality [135] and BLV user satisfaction [23,110].…”

Section: Applying Generative Ai To Explore the Video Accessibility De...mentioning

confidence: 99%

“…To support video accessibility, researchers and practitioners have aimed to increase the quantity of described videos through crowdsourcing platforms and automation (e.g., [6,11,61,84,126,132]). Others have focused on improving the quality of AD through providing authorship guidelines to help determine what content to include or which tone of voice to use (e.g., [3, 26-28, 58, 59, 117]).…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

“It’s Kind of Context Dependent”: Understanding Blind and Low Vision People’s Video Accessibility Preferences Across Viewing Scenarios

Jiang,

Jung,

Phutane

et al. 2024

Proceedings of the CHI Conference on Human Factors in Computing Systems

View full text Add to dashboard Cite

While audio description (AD) is the standard approach for making videos accessible to blind and low vision (BLV) people, existing AD guidelines do not consider BLV users' varied preferences across viewing scenarios. These scenarios range from how-to videos on YouTube, where users seek to learn new skills, to historical dramas on Netflix, where a user's goal is entertainment. Additionally, the increase in video watching on mobile devices provides an opportunity to integrate nonverbal output modalities (e.g., audio cues, tactile elements, and visual enhancements). Through a formative survey and 15 semi-structured interviews, we identified BLV people's video accessibility preferences across diverse scenarios. For example, participants valued action and equipment details for howto videos, tactile graphics for learning scenarios, and 3D models for fantastical content. We define a six-dimensional video accessibility design space to guide future innovation and discuss how to move from "one-size-fits-all" paradigms to scenario-specific approaches. CCS CONCEPTS• Human-centered computing → Accessibility.

show abstract

Section: Applying Generative Ai To Explore the Video Accessibility De...mentioning

confidence: 99%

Section: Video Accessibilitymentioning

confidence: 99%

Section: Applying Generative Ai To Explore the Video Accessibility De...mentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

“It’s Kind of Context Dependent”: Understanding Blind and Low Vision People’s Video Accessibility Preferences Across Viewing Scenarios

Jiang,

Jung,

Phutane

et al. 2024

Proceedings of the CHI Conference on Human Factors in Computing Systems

View full text Add to dashboard Cite

show abstract

“…Automated approaches to video description, currently dominated by deep learning, are usually divided into two stages: 1) visual content extraction or the encoding stage and 2) text generation or the decoding stage. For encoding, convolutional neural networks (CNNs) [71] are used to learn visual features, and for decoding, different variations of recurrent neural networks (RNNs), such as long short-term memory (LSTM) [37] and gated recurrent unit (GRU) [19] networks are used for language modeling and text generation [14,40]. Recent state-of-the-art methods [47,69] have replaced the RNNs with BERT [25] due to the success of Transformers [72].…”

Section: Toward Automated Video Descriptionmentioning

confidence: 99%

NarrationBot and InfoBot: A Hybrid System for Automated Video Description

Ihorn¹,

Siu²,

Bodi³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Video accessibility is crucial for blind and low vision users for equitable engagements in education, employment, and entertainment. Despite the availability of professional and amateur services and tools, most human-generated descriptions are expensive and time consuming. Moreover, the rate of human-generated descriptions cannot match the speed of video production. To overcome the increasing gaps in video accessibility, we developed a hybrid system of two tools to 1) automatically generate descriptions for videos and 2) provide answers or additional descriptions in response to user queries on a video. Results from a mixed-methods study with 26 blind and low vision individuals show that our system significantly improved user comprehension and enjoyment of selected videos when both tools were used in tandem. In addition, participants reported no significant difference in their ability to understand videos when presented with autogenerated descriptions versus humanrevised autogenerated descriptions. Our results demonstrate user enthusiasm about the developed system and its promise for providing customized access to videos. We discuss the limitations of the current work and provide recommendations for the future development of automated video description tools. CCS CONCEPTS• Human-centered computing → Accessibility technologies; Accessibility systems and tools.

show abstract