Multimodal emotion recognition implies the use of different resources and techniques for identifying and recognizing human emotions. A variety of data sources such as faces, speeches, voices, texts and others have to be processed simultaneously for this recognition task. However, most of the techniques, which are based mainly on Deep Learning, are trained using datasets designed and built in controlled conditions, making their applicability in real contexts with real conditions more difficult. For this reason, the aim of this work is to assess a set of in-the-wild datasets to show their strengths and weaknesses for multimodal emotion recognition. Four in-the-wild datasets are evaluated: AFEW, SFEW, MELD and AffWild2. A multimodal architecture previously designed is used to perform the evaluation and classical metrics such as accuracy and F1-Score are used to measure performance in training and to validate quantitative results. However, strengths and weaknesses of these datasets for various uses indicate that by themselves they are not appropriate for multimodal recognition due to their original purpose, e.g., face or speech recognition. Therefore, we recommend a combination of multiple datasets in order to obtain better results when new samples are being processed and a good balance in the number of samples by class.