Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing 2017
DOI: 10.18653/v1/d17-1096
|View full text |Cite
|
Sign up to set email alerts
|

Sound-Word2Vec: Learning Word Representations Grounded in Sounds

Abstract: To be able to interact better with humans, it is crucial for machines to understand sound -a primary modality of human perception. Previous works have used sound to learn embeddings for improved generic semantic similarity assessment. In this work, we treat sound as a first-class citizen, studying downstream 6textual tasks which require aural grounding. To this end, we propose sound-word2vec -a new embedding scheme that learns specialized word embeddings grounded in sounds. For example, we learn that two seemi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
13
0

Year Published

2017
2017
2021
2021

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 19 publications
(13 citation statements)
references
References 19 publications
0
13
0
Order By: Relevance
“…In natural language processing, images have also been used to capture aspects of meaning (semantics) of written language; see [31,32] for reviews. Other studies have considered multimodal modelling of sounds (not speech) with text and images [33][34][35], and phonemes with images [36]. classifier (disregarding the order and quantity of words).…”
Section: Related Workmentioning
confidence: 99%
“…In natural language processing, images have also been used to capture aspects of meaning (semantics) of written language; see [31,32] for reviews. Other studies have considered multimodal modelling of sounds (not speech) with text and images [33][34][35], and phonemes with images [36]. classifier (disregarding the order and quantity of words).…”
Section: Related Workmentioning
confidence: 99%
“…In their work, a 'bag-of-audio-words' approach is used, in which auditory grounding is achieved by dividing sound files into frames, clustering these frames as ''audio words'' and subsequently quantizing them into representations by comparing frame descriptors with the centroids. Recently, Vijayakumar et al [30] proposed an embedding scheme that learns specialized word embeddings grounded in sounds by using a variety of audio features. These techniques were found to work well for modeling human similarity and relatedness judgments and related experiments.…”
Section: Auditory Representationsmentioning
confidence: 99%
“…Auditory grounding was achieved by dividing sound files into frames, clustering these as "audio words" and subsequently quantizing them into representations by comparing frame descriptors with the centroids. More recently, Vijayakumar, Vedantam, and Parikh (2017) proposed an embedding scheme that learns specialized word embeddings grounded in sounds, using a variety of audio features. These techniques were found to work well for modeling human similarity and relatedness judgments and related experiments.…”
Section: Auditory Representationsmentioning
confidence: 99%