2019 IEEE/CVF International Conference on Computer Vision (ICCV) 2019
DOI: 10.1109/iccv.2019.00054
|View full text |Cite
|
Sign up to set email alerts
|

Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings

Abstract: We address the problem of cross-modal fine-grained action retrieval between text and video. Cross-modal retrieval is commonly achieved through learning a shared embedding space, that can indifferently embed modalities. In this paper, we propose to enrich the embedding by disentangling parts-of-speech (PoS) in the accompanying captions. We build a separate multi-modal embedding space for each PoS tag. The outputs of multiple PoS embeddings are then used as input to an integrated multi-modal space, where we perf… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
108
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 147 publications
(108 citation statements)
references
References 35 publications
0
108
0
Order By: Relevance
“…Kazakos et al [69] proposed to fuse video and audio signals for FPV action recognition. Wray et al [70] made use of text descriptions of FPV actions for zero-short learning. In contrast to our work, these prior work did not consider using egocentric gaze for action recognition.…”
Section: First Person Visionmentioning
confidence: 99%
“…Kazakos et al [69] proposed to fuse video and audio signals for FPV action recognition. Wray et al [70] made use of text descriptions of FPV actions for zero-short learning. In contrast to our work, these prior work did not consider using egocentric gaze for action recognition.…”
Section: First Person Visionmentioning
confidence: 99%
“…Specifically, our model achieves 8.9 R@1 improvement over the original HowTo100M model (Miech et al, 2019) and other recent baselines with pre-training on HowTo100M. Using a smaller set of visual fea-Model R@1↑ R@5↑ R@10↑ JSFusion 10.2 31.2 43.2 JPoSE (Wray et al, 2019) 14. English to Video Chinese to Video Model R@1↑ R@5↑ R10↑ R@1↑ R@5↑ R@10↑ VSE (Kiros et al, 2014) 28.0 64.3 76.9 ---VSE++ (Faghri et al, 2018) 33.7 70.1 81.0 ---Dual (Dong et al, 2019) 31.1 67.4 78.9 ---HGR (Chen et al, 2020a) 35.1 73.5 83.…”
Section: Comparison To Supervised State Of the Artmentioning
confidence: 88%
“…Each epoch training is just performed using a single GPU and takes no more than 10 minutes. [30] , MEE [27], MMEN [43], and JPoSE [43], and (3) other methods: JSFusion [49], CCA (FV HGLMM) [16], and Miech et al [26]. The experimental results on MSR-VTT and LSMDC are summarized, respectively, in Table 1 and Table 2.…”
Section: Methodsmentioning
confidence: 99%