Deep Interactive Region Segmentation and Captioning

Boroujerdi, Ali Sharifi; Khanian, Maryam; Breuß, Michael

doi:10.48550/arxiv.1707.08364

Cited by 2 publications

(4 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These solutions, whether they rely on transformers and attention mechanisms [6,7,8], or scene graphs as presented in [9], in which learning is supervised, or relying on beam search analysis or gated recurrent units (GRU) units, in which learning is unsupervised [10,11], generate one single sentence for each input image. Such models are trained on RGB image datasets [12,13].…”

Section: Sentence Captioningmentioning

confidence: 99%

Egocentric Scene Description for the Blind and Visually Impaired

Delloul

Larabi

2022

2022 5th International Symposium on Informatics and Its Applications (ISIA)

View full text Add to dashboard Cite

In recent years, image captioning and segmentation have emerged as crucial tasks in computer vision, with applications ranging from autonomous driving to content analysis. Although multiple solutions have emerged to help blind and visually impaired people move around their environment, few are applications that help them understand and rebuild a scene in their minds through text. Most built models focus on helping users move and avoid obstacles, restricting the number of environments blind and visually impaired people can be in.In this paper, we will propose an approach that helps them understand their surroundings using image captioning. The particularity of our research is that we offer them descriptions with positions of regions and objects regarding them (left, right, front), as well as positional relationships between regions, while we aim to give them access to theatre plays by applying the solution to our TS-RGBD dataset.

show abstract

Section: Sentence Captioningmentioning

confidence: 99%

Egocentric Scene Description for the Blind and Visually Impaired

Delloul

Larabi

2022

2022 5th International Symposium on Informatics and Its Applications (ISIA)

View full text Add to dashboard Cite

show abstract

“…A fully convolutional network (FCN) is trained to predict the foreground/background from image-user interaction pairs. With similar imageuser interaction pairs as input to the network, Boroujerdi et al [17] use a lyncean fully convolutional network to predict foreground/background. This network replaces the last two convolutional layers in the FCN in [15] with three convolutional layers with gradually decreased kernel size to better capture the geometry of objects.…”

Section: Related Workmentioning

confidence: 99%

“…This algorithm leads to improved performance, since it is more closely aligned with the patterns of real users. Essentially, all these networks [15,17,16,18,19,21,28] adopt early fusion structures. They combine the image and the user interaction features from the first layer of DNN.…”

Section: Related Workmentioning

confidence: 99%

“…Recently, deep features produced by deep neural networks (DNNs) have shown their power in many computer vision tasks including image classification [8,9,10] and semantic segmentation [11,12,13,14]. Thus, several researchers [15,16,17,18,19,20,21] have used DNNs to extract deep features with higher-level understanding for image and user interactions to improve interactive image segmentation. Most of these DNN-based methods can be viewed as an early fusion of features using DNN.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A fully convolutional two-stream fusion network for interactive image segmentation

Soltoggio

Lock

et al. 2019

Neural Networks

View full text Add to dashboard Cite

In this paper, we propose a novel fully convolutional two-stream fusion network (FCTSFN) for interactive image segmentation. The proposed network includes two sub-networks: a two-stream late fusion network (TSLFN) that predicts the foreground at a reduced resolution, and a multi-scale refining network (MSRN) that refines the foreground at full resolution. The TSLFN includes two distinct deep streams followed by a fusion network. The intuition is that, since user interactions are more direct information on foreground/background than the image itself, the two-stream structure of the TSLFN reduces the number of layers between the pure user interaction features and the network output, allowing the user interactions to have a more direct impact on the segmentation result. The MSRN fuses the features from different layers of TSLFN with different scales, in order to seek the local to global information on the foreground to refine the segmentation result at full resolution. We conduct comprehensive experiments on four benchmark datasets. The results show that the proposed network achieves competitive performance compared to current state-of-the-art interactive image segmentation methods 1 .

show abstract

Deep Interactive Region Segmentation and Captioning

Cited by 2 publications

References 0 publications

Egocentric Scene Description for the Blind and Visually Impaired

Egocentric Scene Description for the Blind and Visually Impaired

A fully convolutional two-stream fusion network for interactive image segmentation

Contact Info

Product

Resources

About