2022 IEEE Spoken Language Technology Workshop (SLT) 2023
DOI: 10.1109/slt54892.2023.10023079
|View full text |Cite
|
Sign up to set email alerts
|

Towards Visually Prompted Keyword Localisation for Zero-Resource Spoken Languages

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
6
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(6 citation statements)
references
References 29 publications
0
6
0
Order By: Relevance
“…2 and we call it MATTNET (Multimodal ATTention NETwork). We adapt the multimodal localising attention model of [19] that consists of an audio and a vision branch. For the vision branch, we replace ResNet50 [23] with an adaption of AlexNet [24] to encode an image input x vision into a sequence of embeddings y vision .…”
Section: A Word-to-image Attention Mechanismmentioning
confidence: 99%
See 4 more Smart Citations
“…2 and we call it MATTNET (Multimodal ATTention NETwork). We adapt the multimodal localising attention model of [19] that consists of an audio and a vision branch. For the vision branch, we replace ResNet50 [23] with an adaption of AlexNet [24] to encode an image input x vision into a sequence of embeddings y vision .…”
Section: A Word-to-image Attention Mechanismmentioning
confidence: 99%
“…Originally [22], additional linear layers were used after the image embeddings, but removing these did not impact performance. For the audio branch, we use the same audio subnetwork as [19] that consists of an acoustic network f acoustic which extracts speech features from a spoken input x audio . However, [19] takes an entire spoken utterance as x audio , whereas we use a single isolated spoken word.…”
Section: A Word-to-image Attention Mechanismmentioning
confidence: 99%
See 3 more Smart Citations