2017
DOI: 10.48550/arxiv.1707.08364
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Deep Interactive Region Segmentation and Captioning

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(4 citation statements)
references
References 0 publications
0
4
0
Order By: Relevance
“…These solutions, whether they rely on transformers and attention mechanisms [6,7,8], or scene graphs as presented in [9], in which learning is supervised, or relying on beam search analysis or gated recurrent units (GRU) units, in which learning is unsupervised [10,11], generate one single sentence for each input image. Such models are trained on RGB image datasets [12,13].…”
Section: Sentence Captioningmentioning
confidence: 99%
“…These solutions, whether they rely on transformers and attention mechanisms [6,7,8], or scene graphs as presented in [9], in which learning is supervised, or relying on beam search analysis or gated recurrent units (GRU) units, in which learning is unsupervised [10,11], generate one single sentence for each input image. Such models are trained on RGB image datasets [12,13].…”
Section: Sentence Captioningmentioning
confidence: 99%
“…A fully convolutional network (FCN) is trained to predict the foreground/background from image-user interaction pairs. With similar imageuser interaction pairs as input to the network, Boroujerdi et al [17] use a lyncean fully convolutional network to predict foreground/background. This network replaces the last two convolutional layers in the FCN in [15] with three convolutional layers with gradually decreased kernel size to better capture the geometry of objects.…”
Section: Related Workmentioning
confidence: 99%
“…This algorithm leads to improved performance, since it is more closely aligned with the patterns of real users. Essentially, all these networks [15,17,16,18,19,21,28] adopt early fusion structures. They combine the image and the user interaction features from the first layer of DNN.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation