With the ease of obtaining portable devices such as cameras and smartphones, the recording of rst-person videos has become a common habit. These videos are usually very long and tiring to watch, requiring manual edition. Thereby, fast-forward methods emerged seeking to reduce the size of these videos, maximizing the visual quality without losing the relevant information and producing an accelerated video that is pleasant to watch. Despite the recent progress of fast-forward methods, these methods do not consider inserting background music in the videos. Inserting background music can make accelerated videos even more pleasant, as the user will be able to watch the accelerated video combined with their music of interest. This thesis presents a new methodology that creates accelerated videos and automatically inserts the background music, combining the emotions induced by the visual and acoustic modalities. Our method recognizes the emotions induced by video and music over time, using articial neural networks, creating emotion curves for video and music, represented in Russell's model, an emotion representation model widely used in psychology. Our method also has an optimization algorithm that calculates the similarities between video frames and music segments, creating a dynamic cost matrix and computing the optimal path that aligns the video's emotion curve with the music's emotion curve, preserving also the visual quality and temporal continuity of the accelerated video. We evaluated our method in a set of videos and songs with varied content and styles, comparing it quantitatively and qualitatively with other fast-forward methods present in the literature.The results show that our method achieves the best performance in maximizing the similarity of emotions, increasing it signicantly in most cases, while also maintaining the visual quality of the accelerated videos compared to other methods in the literature.