With the development of the Internet and new media, multimedia and audio learning resources have been widely used in teaching and learning. However, their classification and retrieval have become important and urgent issues to be addressed. This study conducted in-depth research on the classification system, construction, and retrieval of multimedia audio learning resources, with the aim of solving several problems with existing research methods, such as timeconsuming manual labeling, inconsistent labeling, and traditional retrieval methods neglecting the correlation between audio and metadata. First, a classification model of audio learning resources was constructed. It processed single-mode data from audios and annotated texts and further abstracted the single-mode information into high-level feature vectors. Then the complementarity between multi-modalities was used to fuse the abstract features or decisionmaking results and eliminate information redundancy between modalities, thereby learning a better feature representation of multimedia audio learning resources. Second, a retrieval method for the resources based on self-similarity matrix filtering was proposed, which aimed to improve the accuracy and efficiency of retrieval. This study provides a new theoretical and practical perspective for classifying and retrieving multimedia audio learning resources.