2018
DOI: 10.1007/s13735-018-0151-5
|View full text |Cite
|
Sign up to set email alerts
|

End-to-end cross-modality retrieval with CCA projections and pairwise ranking loss

Abstract: Cross-modality retrieval encompasses retrieval tasks where the fetched items are of a different type than the search query, e.g., retrieving pictures relevant to a given text query. The state-of-the-art approach to cross-modality retrieval relies on learning a joint embedding space of the two modalities, where items from either modality are retrieved using nearest-neighbor search. In this work, we introduce a neural network layer based on canonical correlation analysis (CCA) that learns better embedding spaces… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
29
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
4
4

Relationship

1
7

Authors

Journals

citations
Cited by 43 publications
(29 citation statements)
references
References 16 publications
0
29
0
Order By: Relevance
“…This is exactly the dimension of our retrieval embedding space. At the top of the network we put a canonically correlated embedding layer (Dorfer et al, 2018) combined with the ranking loss described above. The structure of the model is analogous to the one presented in (Dorfer et al, 2017a) with the single difference that the sheet-image snippet is downsized by factor two (160 × 200 → 80 × 100) before being presented to the network.…”
Section: Embedding Space Learningmentioning
confidence: 99%
See 1 more Smart Citation
“…This is exactly the dimension of our retrieval embedding space. At the top of the network we put a canonically correlated embedding layer (Dorfer et al, 2018) combined with the ranking loss described above. The structure of the model is analogous to the one presented in (Dorfer et al, 2017a) with the single difference that the sheet-image snippet is downsized by factor two (160 × 200 → 80 × 100) before being presented to the network.…”
Section: Embedding Space Learningmentioning
confidence: 99%
“…This requires a different network architecture that can learn two separate projections, one for embedding the sheet music and one for embedding the audio, which can then be used independently of each other. For example, Dorfer, M., et al (2018). Learning Audio-Sheet Music Correspondences for Cross-Modal Retrieval and Piece Identification.…”
Section: Introductionmentioning
confidence: 99%
“…It generalizes the model for uni-modal queries over textual corpora of webpages and documents, which has been the primary concern of IR systems for several decades [2]. Various systems that support non-textual retrieval have been proposed recently [2], [3]. Here, a user searches for information represented as images or audio based on queries that include several modalities beyond textual keywords.…”
Section: Introductionmentioning
confidence: 99%
“…Recently, first IR systems started to support the specification of multi-modal queries [3], [6]. Yet, they do not genuinely evaluate multi-modal queries, but consider them as a combination of several uni-modal queries, each of which covering a different modality.…”
Section: Introductionmentioning
confidence: 99%
“…Choi et al [8] proposed a face identification method based on multi-class pairwise loss. Dorfer et al [9] proposed a multimodal retrieval method based on pairwise losses. Shi et al [10] proposed a deep ranking hash for histopathological retrieval and classification tasks.…”
Section: Introductionmentioning
confidence: 99%