Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-0087
|View full text |Cite
|
Sign up to set email alerts
|

Unsupervised vs. Transfer Learning for Multimodal One-Shot Matching of Speech and Images

Abstract: We consider the task of multimodal one-shot speech-image matching. An agent is shown a picture along with a spoken word describing the object in the picture, e.g. cookie, broccoli and ice-cream. After observing one paired speech-image example per class, it is shown a new set of unseen pictures, and asked to pick the "ice-cream". Previous work attempted to tackle this problem using transfer learning: supervised models are trained on labelled background data not containing any of the one-shot classes. Here we co… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
17
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
2
1
1

Relationship

2
2

Authors

Journals

citations
Cited by 4 publications
(17 citation statements)
references
References 29 publications
(50 reference statements)
0
17
0
Order By: Relevance
“…The first row is a naive indirect baseline where matching is performed on the input features (dynamic time warping (DTW or Dynamic Time Warping) over MFCCs for speech and cosine distance over image pixels). The best overall score of 85.5% is achieved by the direct MTriplet, giving an absolute improvement of more than 25% over the best previous result [8]. This model is followed by the direct MCAE.…”
Section: Resultsmentioning
confidence: 81%
See 3 more Smart Citations
“…The first row is a naive indirect baseline where matching is performed on the input features (dynamic time warping (DTW or Dynamic Time Warping) over MFCCs for speech and cosine distance over image pixels). The best overall score of 85.5% is achieved by the direct MTriplet, giving an absolute improvement of more than 25% over the best previous result [8]. This model is followed by the direct MCAE.…”
Section: Resultsmentioning
confidence: 81%
“…Previous work [7,8] used a two-step indirect approach: a spoken query is compared to the spoken examples in the given support set of speech-image pairs, and the corresponding image is then used to select the closest item in the unseen matching set. The task is therefore reduced to two unimodal comparisons, with the support set acting as a pivot between the modalities.…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations
“…One form of weak supervision is using images paired with spoken captions [7][8][9][10][11][12][13][14][15][16]. Compared to using labelled data, this form of visual supervision is closer to the types of signals that infants would have access to while learning their first language [17][18][19][20][21][22], and to how one would teach new words to robots using spoken language [23,24]. It is also conceivable that this type of visual supervision could be easier to obtain when developing systems for low-resource languages [25], e.g.…”
Section: Introductionmentioning
confidence: 99%