Generative Adversarial Phonology: Modeling Unsupervised Phonetic and Phonological Learning With Neural Networks

Beguš, Gašper

doi:10.3389/frai.2020.00044

Cited by 25 publications

(113 citation statements)

References 98 publications

Supporting

Mentioning

111

Contrasting

Order By: Relevance

“…The result of the training in the architecture outlined in Figure 1 is a Generator network that outputs raw acoustic data that resemble real data from the TIMIT database, such that the Discriminator becomes unsuccessful in assigning "realness" scores (Brownlee, 2019). Crucially, unlike in other architectures, the Generator's outputs are never a full replication of the input: the Generator outputs innovative data that resemble input data, but also violate many of the distributions in a linguistically interpretable manner (Beguš, 2020). In addition to outputting innovative data that resemble speech in the input, the Generator also learns to associate each lexical item with a unique code in its latent space.…”

Section: Modelmentioning

confidence: 99%

“…Language acquisition has, to the author's knowledge, not been modeled with the GAN architecture prior to Beguš (2020), despite several aspects of the architecture that can be paralleled to language acquisition. Beguš (2020) proposes that phonetic and phonological learning can simultaneously be modeled as a dependency between latent space and output data in Deep Convolutional Generative Adversarial Networks (Goodfellow et al, 2014;Radford et al, 2015;Donahue et al, 2019). Unlike in the autoencoder architectures, the outputs of the GAN models are innovative, not directly connected to the inputs, and violate training data distributions in highly informative ways.…”

Section: Introductionmentioning

confidence: 99%

“…Despite their several advantages, to our knowledge, lexical learning has not yet been modeled with unsupervised generative deep convolutional neural network models. In this paper, we follow the proposal in Beguš (2020) that phonetic and phonological acquisition can be modeled as a dependency between latent space and generated data in the GAN architecture and add lexical learning component to the model. We modify the WaveGAN architecture and add the InfoGAN's Q-network (based partially on implementation in Rodionov 2018) to computationally simulate lexical learning from raw acoustic data.…”

Section: Introductionmentioning

confidence: 99%

“…One of the advantages of the proposal in Beguš (2020) is that the model learns phonological alternations, i.e. context-dependent changes in realization of speech sounds, simultaneously with learning of acoustic properties of human speech.…”

Section: Introductionmentioning

confidence: 99%

“…The latent variables that correspond to features can be actively manipulated to generate data with or without some phonetic/phonological properties. These representations, however, are limited to the phonetic/phonological level exclusively in Beguš (2020) and contain no lexical information.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

CiwGAN and fiwGAN: Encoding information in acoustic data to model lexical learning with Generative Adversarial Networks

Beguš

2021

Neural Networks

Self Cite

View full text Add to dashboard Cite

How can deep neural networks encode information that corresponds to words in human speech into raw acoustic data? This paper proposes two neural network architectures for modeling unsupervised lexical learning from raw acoustic inputs, ciwGAN (Categorical InfoWaveGAN) and fiwGAN (Featural InfoWaveGAN), that combine a Deep Convolutional GAN architecture for audio data (WaveGAN; Donahue et al. 2019) with an information theoretic extension of GAN-InfoGAN (Chen et al., 2016), and propose a new latent space structure that can model featural learning simultaneously with a higher level classification. In addition to the Generator and the Discriminator networks, the architectures introduce a network that learns to retrieve latent codes from generated audio outputs. Lexical learning is thus modeled as emergent from an architecture that forces a deep neural network to output data such that unique information is retrievable from its acoustic outputs. The networks trained on lexical items from TIMIT learn to encode unique information corresponding to lexical items in the form of categorical variables in their latent space. By manipulating these variables, the network outputs specific lexical items. The network occasionally outputs innovative lexical items that violate training data, but are linguistically interpretable and highly informative for cognitive modeling and neural network interpretability. Innovative outputs suggest that phonetic and phonological representations learned by the network can be productively recombined and directly paralleled to productivity in human speech: a fiwGAN network trained on suit and dark outputs innovative start, even though it never saw start or even a [st] sequence in the training data. We also argue that setting latent featural codes to values well beyond training range results in almost categorical generation of prototypical lexical items and reveals underlying values of each latent code. Probing deep neural networks trained on well understood dependencies in speech bear implications for latent space interpretability, understanding how deep neural networks learn meaningful representations, as well as a potential for unsupervised text-to-speech generation in the GAN framework.

show abstract

Section: Modelmentioning

confidence: 99%