Video content is present in an ever-increasing number of fields, both scientific and commercial. Sports, particularly soccer, is one of the industries that has invested the most in the field of video analytics, due to the massive popularity of the game and the emergence of new markets (such as sport betting markets). Previous state-of-the-art methods on soccer matches video summarization rely on handcrafted heuristics to generate summaries which is poorly generalizable, but these works have yet proven that multiple modalities help detect the best actions of the game. On the other hand, machine learning models with higher generalization potential have enter the field of summarization of general-purpose videos, offering several deep learning approaches. However, most of them exploit content specificities that are not appropriate for sport wholematch videos. Although video content has been for many years the main source for automatizing knowledge extraction in soccer, the data that records all the events happening on the field has become lately very important in sports analytics, since this event data provides richer context information and requires less processing. Considering that in automatic sports summarization, the goal is not only to show the most important actions of the game, but also to reproduce the storytelling of the whole match with as much emotion as the one evoked by human editors, we propose a method to generate the summary of a soccer match video exploiting both the audio and the event metadata of the entire match. The results show that our method can detect the actions of the match, identify which of these actions should belong to the summary and then propose multiple candidate summaries which are similar enough but with relevant variability to provide different options to the final editor. Furthermore, we show the generalization capability of our work since it can transfer knowledge between datasets from different broadcasting companies, from different competitions, acquired in different conditions, and corresponding to summaries of different lengths.