Remote Sensing Image Captioning with Continuous Output Neural Models

Ramos, Rita Parada; Martins, Bruno

doi:10.1145/3474717.3483631

Cited by 5 publications

(2 citation statements)

References 1 publication

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Retrieved context or exemplars come from large-scale unlabeled collections or labeled sets for supervised learning [73,38]. Computer vision and multimodal models also benefit from retrieval augmentation [41,51]. A focus in NLP have been methods to adapt a pretrained parametric language model into one that can integrate information from retrieved sequences, and joint training of a retriever and generation model.…”

Section: Background and Related Workmentioning

confidence: 99%

ProtEx: A Retrieval-Augmented Approach for Protein Function Prediction

Shaw,

Gurram,

Belanger

et al. 2024

Preprint

View full text Add to dashboard Cite

Mapping a protein sequence to its underlying biological function is a critical problem of increasing importance in biology. In this work, we propose ProtEx, a retrieval-augmented approach for protein function prediction that leverages exem-plars from a database to improve accuracy and robustness and enable generalization to unseen classes. Our approach relies on a novel multi-sequence pretraining task, and a fine-tuning strategy that effectively conditions predictions on retrieved ex-emplars. Our method achieves state-of-the-art results across multiple datasets and settings for predicting Enzyme Commission (EC) numbers, Gene Ontology (GO) terms, and Pfam families. Our ablations and analysis highlight the impact of conditioning predictions on exemplar sequences, especially for classes and sequences less well represented in the training data.

show abstract

Section: Background and Related Workmentioning

confidence: 99%

ProtEx: A Retrieval-Augmented Approach for Protein Function Prediction

Shaw,

Gurram,

Belanger

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

“…ClipCap [113] leverages this collaboration without re-training CLIP or GPT-2; instead, a lightweight transformer-based mapping module is trained to match CLIP representations to GPT-2, which eventually generates the caption c. The CLIP-GPT-2 combination was also followed in VC-GPT [114]. Another lightweight improvement combining a pre-trained CLIP visual encoder and a frozen GPT-2 text decoder further boosts performance of low-resource approaches [115]. A cross-modal filter that selects the most relevant visual information, so that captioning errors are reduced is proposed in [116], still respecting the frozen CLIP-GPT-2 framework.…”

Section: Image Captioning (Ic)mentioning

confidence: 99%

The Contribution of Knowledge in Visiolinguistic Learning: A Survey on Tasks and Challenges

Lymperaiou¹,

Stamou²

2023

Preprint

View full text Add to dashboard Cite

Recent advancements in visiolinguistic (VL) learning have allowed the development of multiple models and techniques that offer several impressive implementations, able to currently resolve a variety of tasks that require the collaboration of vision and language. Current datasets used for VL pre-training only contain a limited amount of visual and linguistic knowledge, thus significantly limiting the generalization capabilities of many VL models. External knowledge sources such as knowledge graphs (KGs) and Large Language Models (LLMs) are able to cover such generalization gaps by filling in missing knowledge, resulting in the emergence of hybrid architectures. In the current survey, we analyze tasks that have benefited from such hybrid approaches. Moreover, we categorize existing knowledge sources and types, proceeding to discussion regarding the KG vs LLM dilemma and its potential impact to future hybrid approaches.

show abstract