Multimodal sentiment analysis (MSA) is one of the core research topics of natural language processing (NLP). MSA has become a challenge for scholars and is equally complicated for an appliance to comprehend. One study that supports MSdifficulties is the MSA, which is learning opinions, emotions, and attitudes in an audio-visual format. In order words, using such diverse modalities to obtain opinions and identify emotions is necessary. Such utilization can be achieved via modality datafusion;such as feature fusion. In handling the data fusion of such diverse modalities while obtaining high performance, a typical machine learning algorithm is Deep Learning (DL), particularly the Convolutional Neutral Network (CNN), which has the capacity to handle tasks of great intricacy and difficulty. In this paper, we present a CNN architecture with an integrated layer via fuzzy methodologies for MSA, a task yet to be explored in improving the accuracy performance of CNN for diverse inputs. Experiments conducted on a benchmark multimodal dataset, MOSI, obtaining 37.5% and 81% on seven (7) class and binary classification respectively, reveals an improved accuracy performance compared with the typical CNN, which acquired 28.9% and 78%, respectively.