2022 7th International Conference on Machine Learning Technologies (ICMLT) 2022
DOI: 10.1145/3529399.3529443
|View full text |Cite
|
Sign up to set email alerts
|

Few-Shot Keyword Spotting With Prototypical Networks

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
9
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
2
2
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 16 publications
(9 citation statements)
references
References 7 publications
0
9
0
Order By: Relevance
“…Some works [15,20] approach the problem of KWS as a detection task. [15] designs a model architecture with multi-head attention layers and introduces soft-triple loss, which is a combination of triplet loss and softmax loss for learning feature representations.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Some works [15,20] approach the problem of KWS as a detection task. [15] designs a model architecture with multi-head attention layers and introduces soft-triple loss, which is a combination of triplet loss and softmax loss for learning feature representations.…”
Section: Related Workmentioning
confidence: 99%
“…[15] designs a model architecture with multi-head attention layers and introduces soft-triple loss, which is a combination of triplet loss and softmax loss for learning feature representations. [20] proposes metric learning-based prototypical network that can effectively extract distinctive features to detect user-defined keywords. The method still requires an additional incremental training process to adapt the model to the target user-defined keywords.…”
Section: Related Workmentioning
confidence: 99%
“…Therefore, reducing the impact of noise on performance is the focus of many speech-related tasks, including KWS. For the most typical speech task, Audio Speech Recognition (ASR), audio-visual fusion has been proven to be a promising technique to tackle the noise problem, since the visual information is not affected by acoustic distortions [13][14][15][16][17][18][19][20][21]. As the most common forms of perception in human communication, hearing and watching have received increasing attention from researchers in the multi-modal field.…”
Section: Introductionmentioning
confidence: 99%
“…Although existing predefined KWS models show high detection performance [1,2,3], the necessity of a large dataset containing target keywords and inflexibility of changing target keywords hinder KWS models from expanding to various applications. When it comes to userdefined KWS, users can customize the target keywords with only a few enrollment samples [4,5,6,7] or in the form of string [8,9]. Few-shot KWS (FS-KWS) especially has shown its feasibility through meta learning [4], transfer learning [5], and metric learning [6,7], operating on the few-shot detection scenario.…”
Section: Introductionmentioning
confidence: 99%
“…When it comes to userdefined KWS, users can customize the target keywords with only a few enrollment samples [4,5,6,7] or in the form of string [8,9]. Few-shot KWS (FS-KWS) especially has shown its feasibility through meta learning [4], transfer learning [5], and metric learning [6,7], operating on the few-shot detection scenario. These approaches typically require learning from a large corpus with lots of different keywords to secure generalization on unseen keywords with few samples.…”
Section: Introductionmentioning
confidence: 99%