16This paper introduces active listening, as a unified framework for synthesising and recognising speech. The 17 notion of active listening inherits from active inference, which considers perception and action under one 18 universal imperative: to maximise the evidence for our (generative) models of the world. First, we describe 19 a generative model of spoken words that simulates (i) how discrete lexical, prosodic, and speaker attributes 20give rise to continuous acoustic signals; and conversely (ii) how continuous acoustic signals are recognised 21 as words. The 'active' aspect involves (covertly) segmenting spoken sentences and borrows ideas from 22 active vision. It casts speech segmentation as the selection of internal actions, corresponding to the 23 placement of word boundaries. Practically, word boundaries are selected that maximise the evidence for an 24 internal model of how individual words are generated. We establish face validity by simulating speech 25 recognition and showing how the inferred content of a sentence depends on prior beliefs and background 26 noise. Finally, we consider predictive validity by associating neuronal or physiological responses, such as 27 the mismatch negativity and P300, with belief updating under active listening, which is greatest in the 28 absence of accurate prior beliefs about what will be heard next. 29 30 31 Key words: speech recognition, voice, active inference, active listening, segmentation, variational Bayes, 32 audition.'invariant' (Liberman, Cooper et al. 1967)-words are never heard out of a particular context. When 54 considering how words are generated, there is wide variability in the pronunciation of the same word among 55 different speakers (Hillenbrand, Getty et al. 1995, Remez 2010)-and even when spoken by the same 56 speaker, pronunciation depends on prosody (Bänziger and Scherer, 2005). From the perspective of 57 recognition, two signals that are acoustically identical can be perceived as different words or phonemes by 58 human listeners, depending on their context-for example, the preceding words or phonemes (Mann 1980, 59 Miller, Green et al. 1984, preceding spectral content (Holt, Lotto et al. 2000), or the duration of a vowel 60 that follows a consonant (Miller and Liberman 1979). The current approach considers the processes The idea that speech segmentation and lexical inference operate together did not figure in early accounts of 63 speech recognition. For example, the Fuzzy Logic Model of Perception (FLMP) (Oden and Massaro 1978, 64 Massaro 1987, Massaro 1989 matches acoustic features with prototype representations to recognise 65 phonemes, even when considered in the context of words and sentences. Similarly, the Neighbourhood 66 Activation Model (NAM) (Luce 1986, Luce and Pisoni 1998) considers individual word recognition; it 67 accounts for effects of word frequency, but does not address the segmentation problem. Later connectionist 68 accounts, such as TRACE (McClelland and Elman 1986), assumed that competition between lexical nodes ...