Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017) 2017
DOI: 10.18653/v1/k17-1037
|View full text |Cite
|
Sign up to set email alerts
|

Encoding of phonology in a recurrent neural model of grounded speech

Abstract: We study the representation and encoding of phonemes in a recurrent neural network model of grounded speech. We use a model which processes images and their spoken descriptions, and projects the visual and auditory representations into the same semantic space. We perform a number of analyses on how information about individual phonemes is encoded in the MFCC features extracted from the speech signal, and the activations of the layers of the model. Via experiments with phoneme decoding and phoneme discriminatio… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

3
69
0

Year Published

2018
2018
2025
2025

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 46 publications
(72 citation statements)
references
References 31 publications
3
69
0
Order By: Relevance
“…Various phonetic and speaker features were investigated in speaker embeddings [9,10], and properties like style and accent were analyzed in a convolutioanl ASR performance prediction model [11]. Another line of work is concerned with developing and analyzing joint audio-visual models [12,13,14,15].…”
Section: Analysis Of Representationsmentioning
confidence: 99%
“…Various phonetic and speaker features were investigated in speaker embeddings [9,10], and properties like style and accent were analyzed in a convolutioanl ASR performance prediction model [11]. Another line of work is concerned with developing and analyzing joint audio-visual models [12,13,14,15].…”
Section: Analysis Of Representationsmentioning
confidence: 99%
“…This paper follows the same general theme, but with a different focus. While [21] and [22] examined the utility of the intermediate representations of visually-grounded speech models to perform tasks such as speaker, phoneme, and word discrimination, they did not investigate if and how discrete, sub-word units may be emerging within the models. Visually-grounded, self-supervised models such as DAVEnet make relatively few assumptions about how sub-word units should be represented.…”
Section: Introduction and Prior Workmentioning
confidence: 99%
“…[4] found that final layers tend to encode semantic information whereas lower layers tend to encode form-related information. [7] showed that a non trivial amount of phonological information is preserved in higher layers, and suggested that the attention layer focuses on semantic information.…”
Section: Introductionmentioning
confidence: 99%
“…Such computational models can be used to emulate child language acquisition and could shed light on the inner cognitive pro-This work was supported by grants from NeuroCoG IDEX UGA as part of of the "Investissements d'avenir" program (ANR-15-IDEX-02) cesses at work in humans as suggested by [15]. While [11,7,4] focused on analyzing speech representations learnt by speech-image neural models from a phonological and semantic point of view, the present work focuses on lexical acquisition and the way speech utterances are segmented into lexical units and processed by a computational model of visually grounded speech. We analyze a key component of the neural model -the attention mechanism -and we observe its behaviour and draw parallels between artificial neural attention and human attention.…”
Section: Introductionmentioning
confidence: 99%