Computational paralinguistics aims to infer human emotions, personality traits and behavioural patterns from speech signals. In particular, verbal conflict is an important example of human-interaction behaviour, whose detection would enable monitoring and feedback in a variety of applications. The majority of methods for detection and intensity estimation of verbal conflict apply off-the-shelf classifiers/regressors to generic handcrafted acoustic features. Generating conflict-specific features requires refinement steps and the availability of metadata, such as the number of speakers and their speech overlap duration. Moreover, most techniques treat feature extraction and regression as independent modules, which require separate training and parameter tuning. To address these limitations, we propose the first end-to-end convolutional-recurrent neural network architecture that learns conflict-specific features directly from raw speech waveforms, without using explicit domain knowledge or metadata. Additionally, to selectively focus the model on portions of speech containing verbal conflict instances, we include a global attention interface that learns the alignment between layers of the recurrent network. Experimental results on the SSPNet Conflict Corpus show that our end-to-end architecture achieves state-ofthe-art performance in terms of Pearson Correlation Coefficient.