Cross-modal retrieval takes one modality data as the query to search related data from different modalities (e.g. images vs. texts). As the heterogeneous gap exists between different media data, mainstream methods focus on reducing modality gap using common space learning. However, the heterogeneous media gap is big and it is too hard to be eliminated completely. Besides this, the representations of the same modality are diverse, which is important but is ignored by most existing methods. In this paper, we propose a novel cross-modal retrieval via Similarity-preserving Learning and Semantic Average Embedding (SLSAE) method. There are two key ideas in our method, one is to reduce modality gap by similarity-preserving learning, the other is to use semantic average embedding to weaken the impact of diversity existing in the common space. The similarity-preserving learning process will push embeddings from the same category together and pull embeddings from different categories apart. Eliminating the influence of embeddings diversity can improve performance and robustness, which is more friendly to real-world cross-modal retrieval applications. The model of proposed method is concise, and can be extended to multimodal retrieval situation flexibly. Comprehensive experimental results show that our method significantly outperforms the state-of-the-art methods in bimodal cross-modal retrieval, and it also achieves excellent performance in multimodal retrieval scenarios.INDEX TERMS common space learning, cross-modal retrieval, multimodal retrieval, similarity-preserving learning.