With the development of the Internet and social media, a large amount of multimedia data are generated and uploaded every day. Although these multimedia data might have different modalities, such as texts, images, videos, and audio, there is a semantic correlation among them. Effective cross-modal and multi-modal learning imposes great opportunities for many practical applications, such as cross-modal retrieval, matching, recommendation, and classification, which play important roles in public security, social media, entertainment, healthcare, etc. However, due to the natural heterogeneous property of cross-modal data, it is very challenging to investigate the correlation among data of different modalities to deal with practical tasks.This special issue aims to assemble recent advances in cross-modal retrieval and analysis to handle these existing problems and benefit relevant researchers. It is a joint special issue that cooperates with the China Multimedia Conference 2022. We received 36 submissions, and seven papers are selected for publication after at least double peer-review process. We are pleased to present them in the following.In order to investigate the precise inter-modality relationship for cross-modal retrieval tasks, the paper, "Prototype Local-Global Alignment Network for Image-Text Retrieval" by L. Meng, F. Zhang, X. Zhang and C. Xu, presents a novel framework to jointly perform the fine-grained local alignment and high-level global alignment. On the one hand, prototype-based local alignment divides the region-word alignment into the region-prototype and word-prototype alignment, which can well bridge the modality gap and avoid B Richang Hong