2021
DOI: 10.1007/978-3-030-92273-3_9
|View full text |Cite
|
Sign up to set email alerts
|

A Joint Representation Learning Approach for Social Media Tag Recommendation

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
1
1
1
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(7 citation statements)
references
References 16 publications
0
7
0
Order By: Relevance
“…Protein-Text Contrastive Learning (PTC) aims to align the feature space of the protein encoder and the text decoder, encouraging parallel protein-text pairs have higher similarity scores. This objective has been demonstrated to be effective in ALBEF [32] in image-text learning task.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…Protein-Text Contrastive Learning (PTC) aims to align the feature space of the protein encoder and the text decoder, encouraging parallel protein-text pairs have higher similarity scores. This objective has been demonstrated to be effective in ALBEF [32] in image-text learning task.…”
Section: Methodsmentioning
confidence: 99%
“…Protein-Text Matching (PTM) predicts whether a pair of protein and text is positive (matched) or negative (not matched). We follow the negative sampling strategy [32] , where negative pairs with higher contrastive similarity in a batch have a higher chance to be sampled. It aims to learn protein-text multimodal representation for protein-text pair that share similar semantics but differ in fine-grained details.…”
Section: Methodsmentioning
confidence: 99%
“…Four TF blocks are used, one for each of the unimodal and multimodal processing of each modality. In the multimodal blocks, pairwise cross-attention is added as means of communication between modalities, similar to [29,28,65], to ensure the coordination of the modalities. The model is trained using multiple supervised losses for the unimodal and multimodal predictions, and a self-supervised objective for aligning the unimodal representations, following previous works in vision and text [49,29].…”
Section: Our Contributionmentioning
confidence: 99%
“…Most works follow the general framework of aggregating information from one part of the network and transmitting it in a compressed form to the others. This can achieved for example with the summation of intermediate CNN channels [20], the exchange of intermediate representations [62], any of the Squeeze and Excite gates [34] or cross-attention weights [32,59,29,28,65]. The latter approach will also be exploited by CoRe-Sleep.…”
Section: Introductionmentioning
confidence: 99%
“…Owens et al [12] studied a self supervised multi-sensory training method that automatically aligns audio and video. Li et al [13] used the methods of pre fusion alignment and momentum distillation to maximize mutual information, so as to make the modes correspond to each other as far as possible to achieve the goal of resource balance. Wang et al [14] used a gradient mixing method to solve the problem of varying generalization speeds for different modal data.…”
Section: Related Workmentioning
confidence: 99%