English teaching materials serve as a critical vehicle for instruction, with well-designed materials fostering positive learning habits and interests among students. This study employs an ecological philosophy approach and multimodal discourse analysis to examine the modal shifts in college English textbooks. It utilizes the BiFPN network model to capture image features within these materials. Furthermore, the TF-IDF method extracts key terms from the textbook text, while the integration of a CNN-GRU model facilitates the classification of these terms. Additionally, this research introduces relevant computational formulas from text readability theory to evaluate the difficulty levels of these textbooks. The analysis focuses on the “New Vision College English Textbook” series, volumes Compulsory 1 through Compulsory 4. It explores the semantic relationships between text and graphics, chapter-specific reading challenges, and overall text readability indices. Findings indicate that the average proportion of graphic-text equality relations stands at 58.30%, with the highest occurrence of images depicting detailed relationships totaling 217. The Grade Level index for Compulsory 4 reaches 1.61, signifying a high complexity, whereas the Flesch Reading Ease (FRE) score for Compulsory 1 peaks at 75.42, suggesting easier comprehension. In contrast, Compulsory 2 and Compulsory 4 exhibit lower readability scores. Through multimodal discourse analysis, the study delineates the varying difficulty levels across college English textbooks, advocating for a graded approach to textbook development that aligns with students’ evolving competencies. This strategy is poised to significantly boost students’ engagement and facilitate more effective learning.