Age estimation and gender detection are essential tasks in speech analysis and understanding, with applications in various domains. Traditional approaches primarily rely on acoustic features extracted from speech signals, which may be limited by environmental noise and recording conditions. To address these challenges, we propose an improved approach that leverages multimodal speech data, combining audio, visual, and textual features for age estimation and gender detection. Our methodology includes a comprehensive analysis of multimodal features, a novel fusion strategy for integrating the features, and an evaluation of a large-scale multimodal speech dataset. Experimental results demonstrate the effectiveness and superiority of our approach compared to state-of-the-art methods in terms of accuracy, robustness, and generalization capabilities. This work contributes to the advancement of speech analysis techniques and enhances the performance of speech-based applications. This study applies four methods, Decision Trees (DT), Random Forests (RF),Neural Networks (CNN), and CNN with cross-validation.. The accuracy of DT, Random Forest, CCN and CNN with cross validation algorithms are 0.9317%, 0.8341%,0.8% and 0.8537%, respectively for male dataset, 0.8563%, 0.657%1, 0.7433% and 0.7682%, respectively for female dataset then 0.8563%, 0.6839%, 0.7241%, 0.7452%, respectively for combined dataset.