Captioning Images Taken by People Who Are Blind

Gurari, Danna; Zhao, Yinan; Zhang, Meng; Bhattacharya, Nilavra

doi:10.1007/978-3-030-58520-4_25

Cited by 121 publications

(72 citation statements)

References 47 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To recognize icons, previous work [50,76,77] trained image classication models from UI design datasets [27]. To describe content in pictures, prior work used deep learning models to generate natural language descriptions of images [44,46], and some accessibility improvement research has also leveraged crowdsourcing to generate image captions [35,39,40]. We use an existing Icon Recognition engine and Image Descriptions feature in iOS [4] to generate alternative text for detected icons and pictures, respectively.…”

Section: Understanding Ui Semanticsmentioning

confidence: 99%

Screen Recognition: Creating Accessibility Metadata for Mobile Applications from Pixels

Zhang

Greef

Swearngin

et al. 2021

Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems

View full text Add to dashboard Cite

Many accessibility features available on mobile platforms require applications (apps) to provide complete and accurate metadata describing user interface (UI) components. Unfortunately, many apps do not provide sucient metadata for accessibility features to work as expected. In this paper, we explore inferring accessibility metadata for mobile apps from their pixels, as the visual interfaces often best reect an app's full functionality. We trained a robust, fast, memory-ecient, on-device model to detect UI elements using a dataset of 77,637 screens (from 4,068 iPhone apps) that we collected and annotated. To further improve UI detections and add semantic information, we introduced heuristics (e.g., UI grouping and ordering) and additional models (e.g., recognize UI content, state, interactivity). We built Screen Recognition to generate accessibility metadata to augment iOS VoiceOver. In a study with 9 screen reader users, we validated that our approach improves the accessibility of existing mobile apps, enabling even previously inaccessible apps to be used. CCS CONCEPTS• Human-centered computing ! Accessibility technologies.

show abstract

Section: Understanding Ui Semanticsmentioning

confidence: 99%

Screen Recognition: Creating Accessibility Metadata for Mobile Applications from Pixels

Zhang

Greef

Swearngin

et al. 2021

Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems

View full text Add to dashboard Cite

show abstract

“…A particular challenge in this area has been the lack of an authentic dataset of photos taken by the blind. To address the issue, Gurari et al (2020) created VizWiz-Captions, a dataset that consists of descriptions of images taken by people who are blind. In addition, they analyzed how the SOTA image captioning algorithms performed on this dataset.…”

Section: Related Workmentioning

confidence: 99%

“…The Vizwiz Captions dataset (Gurari et al, 2020) consists of over 39, 000 images originating from people who are blind that are each paired with five captions. The dataset consists of 23, 431 training images, 7, 750 validation images and 8, 000 test images.…”

Section: Datasetmentioning

confidence: 99%

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop

2021

View full text Add to dashboard Cite

show abstract

“…Thus, these models perform poorly on images clicked by blind people * Equal contribution largely because the images clicked by blind people differ dramatically from the images present in the datasets. To encourage solving this problem, Gurari et al (2020) released the VizWiz dataset, a dataset comprising of images taken by the blind. Current work on captioning images for the blind do not use the text detected in the image when generating captions (Figures 1a and 1b show two images from the VizWiz dataset that contain text).…”

Section: Introductionmentioning

confidence: 99%

Multi-Modal Image Captioning for the Visually Impaired

Ahsan¹,

Bhatt²,

Shah³

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Rese

View full text Add to dashboard Cite

One of the ways blind people understand their surroundings is by clicking images and relying on descriptions generated by image captioning systems. Current work on captioning images for the visually impaired do not use the textual data present in the image when generating captions. This problem is critical as many visual scenes contain text. Moreover, up to 21% of the questions asked by blind people about the images they click pertain to the text present in them (Bigham et al., 2010). In this work, we propose altering AoANet, a state-of-the-art image captioning model, to leverage the text detected in the image as an input feature. In addition, we use a pointer-generator mechanism to copy the detected text to the caption when tokens need to be reproduced accurately. Our model outperforms AoANet on the benchmark dataset VizWiz, giving a 35% and 16.2% performance improvement on CIDEr and SPICE scores, respectively..

show abstract

Captioning Images Taken by People Who Are Blind

Cited by 121 publications

References 47 publications

Screen Recognition: Creating Accessibility Metadata for Mobile Applications from Pixels

Screen Recognition: Creating Accessibility Metadata for Mobile Applications from Pixels

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop

Multi-Modal Image Captioning for the Visually Impaired

Contact Info

Product

Resources

About