Sensory processing is increasingly conceived in a predictive framework in which neurons would constantly process the error signal resulting from the comparison of expected and observed stimuli. Surprisingly, few data exist on the amount of predictions that can be computed in real sensory scenes. Here, we focus on the sensory processing of auditory and audiovisual speech. We propose a set of computational models based on artificial neural networks (mixing deep feed-forward and convolutional networks) which are trained to predict future audio observations from 25 ms to 250 ms past audio or audiovisual observations (i.e. including lip movements). Experiments are conducted on the multispeaker NTCD-TIMIT audiovisual speech database. Predictions are efficient in a short temporal range (25-50 ms), predicting 40 to 60 % of the variance of the incoming stimulus, which could result in potentially saving up to 2/3 of the processing power. Then they quickly decrease to vanish after 100 ms. Adding information on the lips slightly improves predictions, with a 5 to 10 % increase in explained variance. Interestingly the visual gain vanishes more slowly, and the gain is maximum for a delay of 75 ms between image and predicted sound. 5 experimental developments, the predictive brain has been mathematically encapsulated 6 by Friston and colleagues into a powerful framework based on Bayesian modeling [3], 7 associating such concepts as perceptual inference [4], reinforcement learning [5] and 8 optimal control [6]. In this framework, it has been proposed that the minimization of 9 free energy -a concept coming from thermodynamics-could provide a general principle 10 associating perception and action in interaction with the environment in a coherent 11 predictive process [7,8]. A number of recent neurophysiological studies confirm the 12 accuracy of the predictive coding paradigm for analyzing sensory processing in the 13 human brain (e.g. [9]). 14 Actually, predictive coding is a general methodological paradigm in information 15 processing that consists in analyzing the local regularities in an input data stream in 16 1/25 order to extract the predictable part of these input data. The information processing 17 system can then focus on the difference between input data and their prediction. In a 18 very general manner, whatever the processing system, there are two main advantages to 19 processing the difference signal over directly processing the input signal. First, if the 20 prediction is efficient, the difference signal is generally of (much) lower energy than the 21 original signal, which leads to energy consumption saving in subsequent processes and 22 resource saving for representing the signal with a given accuracy (e.g. bitrate saving in 23 an audio or a video coder). In short, this reduces the "cost" of information processing. 24 Second, there is a concentration of novelty / unpredictable information in the difference 25 signal, which is exploitable for, e.g., the detection of new events. Because of these 26 advantages, predi...