2016
DOI: 10.48550/arxiv.1604.01729
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Improving LSTM-based Video Description with Linguistic Knowledge Mined from Text

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
24
0

Year Published

2017
2017
2020
2020

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 18 publications
(25 citation statements)
references
References 0 publications
1
24
0
Order By: Relevance
“…The general framework used also consists of applying one or several CNNs to the input images, followed by a generative LSTM for sentence generation. Several improvements and variations have been proposed to the main model, some examples being the use of additional CNN representations like optical flow (Venugopalan et al, 2015;Yao et al, 2015), Bidirectional LSTMs (BLSTM) in the encoder (Peris et al, 2016), attention mechanisms in the decoder (Yao et al, 2015), hierarchical information (Pan et al, 2016), external linguistic knowledge (Venugopalan et al, 2016) or multi-modal attention mechanisms (Hori et al, 2017), among others.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…The general framework used also consists of applying one or several CNNs to the input images, followed by a generative LSTM for sentence generation. Several improvements and variations have been proposed to the main model, some examples being the use of additional CNN representations like optical flow (Venugopalan et al, 2015;Yao et al, 2015), Bidirectional LSTMs (BLSTM) in the encoder (Peris et al, 2016), attention mechanisms in the decoder (Yao et al, 2015), hierarchical information (Pan et al, 2016), external linguistic knowledge (Venugopalan et al, 2016) or multi-modal attention mechanisms (Hori et al, 2017), among others.…”
Section: Related Workmentioning
confidence: 99%
“…Similarly to the video description problem, in the egocentric task, we have as input a sequence of frames and we want to output a sentence that describes the input. This problem has already been tackled in the literature on conventional videos with multiple variations (Pan et al, 2016;Venugopalan et al, 2015Venugopalan et al, , 2016Yao et al, 2015;Peris et al, 2016). In the latter work, the frames were encoded by a CNN and a BLSTM network.…”
Section: Egocentric Captioningmentioning
confidence: 99%
“…al. [78] proposed an investigation on how linguistic knowledge can aid the video to text generation. In this approach, they amalgamated both neural language model and distributional semantic embedding for the generation of the text in three ways viz.…”
Section: Recentmentioning
confidence: 99%
“…The words in the caption are represented as vectors in a Word Embedding (WE) [34], [35]. The WE is usually learned during the training of the NLVD system [36], or in some cases is a pretrained WE, as in [37]. The decoder is trained to predict the probability of each word in the vocabulary to be the next one in the sentence based on the video encoding and the previous words in the sentence.…”
Section: Related Workmentioning
confidence: 99%