Speech based continuous emotion prediction systems have predominantly been based on complex non-linear back-ends, with an increasing attention on long-short term memory recurrent neural networks. While this has led to accurate predictions, complex models may suffer from issues with interpretability, model selection and overfitting. In this paper, we demonstrate that a linear model can capture most of the relationship between speech features and emotion labels in the continuous arousal-valence space. Specifically, an autoregressive exogenous model (ARX) is shown to be an effective backend. This approach is validated on three commonly used databases, namely RECOLA, SEWA and USC CreativeIT, and shown to be comparable in terms of performance to state-of-the-art LSTM systems. More importantly, this approach allows for the use of well-established linear system theory to aid with model interpretability.