2022
DOI: 10.48550/arxiv.2202.10492
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

CaMEL: Mean Teacher Learning for Image Captioning

Abstract: Describing images in natural language is a fundamental step towards the automatic modeling of connections between the visual and textual modalities. In this paper we present CaMEL, a novel Transformer-based architecture for image captioning. Our proposed approach leverages the interaction of two interconnected language models that learn from each other during the training phase. The interplay between the two language models follows a mean teacher learning paradigm with knowledge distillation. Experimentally, w… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 44 publications
(93 reference statements)
0
1
0
Order By: Relevance
“…These are in the form of soft labels, which the captioning model has to align within the cross-entropy phase, and re-weighting of the caption words to guide the fine-tuning phase. [5] improve the quality with the interaction of two interconnected language models that learn from each other. Additional improvement to the performance of recent self-attention-based image captioning approaches is due to the use of large-scale vision-and-language pre-training [8,33,43,62,65], which can be done on noisy and weakly annotated image-text pairs, also exploiting pre-training losses different from cross-entropy, such as the masked word loss [62].…”
Section: Related Workmentioning
confidence: 99%
“…These are in the form of soft labels, which the captioning model has to align within the cross-entropy phase, and re-weighting of the caption words to guide the fine-tuning phase. [5] improve the quality with the interaction of two interconnected language models that learn from each other. Additional improvement to the performance of recent self-attention-based image captioning approaches is due to the use of large-scale vision-and-language pre-training [8,33,43,62,65], which can be done on noisy and weakly annotated image-text pairs, also exploiting pre-training losses different from cross-entropy, such as the masked word loss [62].…”
Section: Related Workmentioning
confidence: 99%