Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems 2021
DOI: 10.1145/3411763.3451810
|View full text |Cite
|
Sign up to set email alerts
|

Automated Video Description for Blind and Low Vision Users

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
3

Relationship

1
5

Authors

Journals

citations
Cited by 7 publications
(6 citation statements)
references
References 20 publications
0
6
0
Order By: Relevance
“…• Level of Detail: Prior work has investigated the potential for visual question answering systems to enable users to query for details that they wish to know [6,58,110]. As AI advances, it may one day be possible to provide end users with high degrees of flexibility for which details and what level of detail they would like through automatically generated descriptions.…”
Section: Applying Generative Ai To Explore the Video Accessibility De...mentioning
confidence: 99%
See 3 more Smart Citations
“…• Level of Detail: Prior work has investigated the potential for visual question answering systems to enable users to query for details that they wish to know [6,58,110]. As AI advances, it may one day be possible to provide end users with high degrees of flexibility for which details and what level of detail they would like through automatically generated descriptions.…”
Section: Applying Generative Ai To Explore the Video Accessibility De...mentioning
confidence: 99%
“…Other research has shown that BLV people wish to interact and engage with video content in ways other than only listening to preset neutral descriptions during the video itself. For example, Stangl et al [110] and Bodi et al [6] investigated the viability of providing video access through interactive visual question answering, reinforcing the importance of BLV users having agency in the process of making videos accessible. Others explored the impact of changing the tone or style of verbal descriptions for select video types, finding that alternative AD styles were engaging for BLV users [30,59,121,125].…”
Section: Video Accessibilitymentioning
confidence: 99%
See 2 more Smart Citations
“…Automated approaches to video description, currently dominated by deep learning, are usually divided into two stages: 1) visual content extraction or the encoding stage and 2) text generation or the decoding stage. For encoding, convolutional neural networks (CNNs) [71] are used to learn visual features, and for decoding, different variations of recurrent neural networks (RNNs), such as long short-term memory (LSTM) [37] and gated recurrent unit (GRU) [19] networks are used for language modeling and text generation [14,40]. Recent state-of-the-art methods [47,69] have replaced the RNNs with BERT [25] due to the success of Transformers [72].…”
Section: Toward Automated Video Descriptionmentioning
confidence: 99%