Understanding how external stimuli are transformed into meaningful impressions that guide human actions has been an enduring challenge. Automatic models of perception of multimedia productions stand out as a path to the characterisation of our interaction with a type of content that has nowadays flooded social media platforms and people's digital time. Aided by well-established theories of perception, we identify memorability, attention, judgements and the emotional state as cognitive and affective variables that provide complementary views towards the comprehension of our perception of multimedia content.Intrinsic media memorability is defined as an inherent property of the visual features of videos that determines the percentage of people who remember watching a clip in a second viewing. Our approaches are based on the extraction of video-level, topic-oriented features using pre-trained Transformers. We find that linear models trained using these features as inputs can reach prediction rates comparable to other state-of-the-art models across several datasets. Secondly, we characterise the attention to short movies at a group-level by means of using electrodermal activity recordings as ground-truth. We develop a binary classification system whose predictions, based on a semantically-driven representation of the acoustic signal of videos, denote whether the group-level attention increases or diminishes. Next, we address the judgement of images attending to a fitness criterion to tourism attractiveness. Our proposal builds upon a Mixture-Of-Experts system to leverage information relative to geolocation tags, which implicitly points to specific semantics and contents, seeking to incorporate into the model design the knowledge about the role of context during the annotation process. In order to predict the emotions elicited by historical artworks, we employ vision-language cross-modal models that seek to exploit the subjective and figurative nature within the artistic domain. We introduce a methodology to approximate systems pre-trained in realistic content to the art domain, finding that following it leads to significant improvements (up to 27%) in predicting emotions. However, given the complexity of understanding the rationale behind "black-box" visual prediction models, hence our last contribution targets the enhancement of the interpretability of these systems. We explore how to improve the explanations provided by LIME, a popular surrogate-based, post-hoc explanatory technique, by indirectly adding information about the statistic of the data distribution the "black-box" model is trained on.We believe this thesis contributes to comprehending the human perception of multimedia content by addressing from a computational perspective several cognitive and affective variables that conform to it. In particular, our approaches seek to combine information from multiple modalities, presenting models that extract patterns from low-level features of the inputs and relate them to human actions and responses spanning variou...