Alzheimer’s Disease is a neurodegenerative disorder, and one of its common and prominent early symptoms is language impairment. Therefore, early diagnosis of Alzheimer’s Disease through speech and text information is of significant importance. However, the multimodal data is often complex and inconsistent, which leads to inadequate feature extraction. To address the problem, We propose a model for early diagnosis of Alzheimer’s Disease based on multimodal attention(EDAMM). Specifically, we first evaluate and select three optimal feature extraction methods, Wav2Vec2.0, TF-IDF and Word2Vec, to extract acoustic and linguistic features. Next, by leveraging self-attention mechanism and cross-modal attention mechanisms, we generate fused features to enhance and capture the inter-modal correlation information. Finally, we concatenate the multimodal features into a composite feature vector and employ a Neural Network(NN) classifier to diagnose Alzheimer’s Disease. To evaluate EDAMM, we perform experiments on two public datasets, i.e., NCMMSC2021 and ADReSSo. The results show that EDAMM improves the performance of Alzheimer’s Disease diagnosis over state-of-the-art baseline approaches on both datasets.