Massively generated single-cell multi-omics datasets are revolutionizing biological studies of heterogenous tissues and organisms, which necessitate powerful computational methods to unleash the full potential of these tremendous data. Here, we present Concerto, stands for self-distillation contrastive learning of cell representations, a self-supervised representation learning framework optimized with asymmetric teacher-student configuration to analyze single-cell multi-omics datasets with scalability up to building 10 million-cell reference within 1.5 hour and querying 10k cells within 8 seconds. Concerto leverages dropout layer as minimal data augmentation to learn meaningful cell representations in a contrastive manner. The teacher module uses attention mechanism to aggregate contextualized gene embeddings within cellular context, while the student module uses simpler dense structure with discreate input. The learned task-agnostic representations can be adapted to a broad range of single-cell computation tasks. 1) Via supervised fine-tuning, Concerto enables automatic cell classification as well as novel cell-type discovery; 2) Attention weights provide model interpretability via automatically extracting specific molecular signatures at single-cell resolution without the needs of clustering; 3) Via source-aware training, Concerto supports efficient data integration by projecting all cells across multiple batches into a joint embedding space. 4) Via batch-aware inference or unsupervised fine-tuning, Concerto enables mapping query cells onto reference and accurately transferring annotations. Concerto can flexibly extend to multi-omics datasets simply through cross-modality summation operation to obtain unified cell embeddings. Using examples from human peripheral blood, human thymus, human pancreas, and mouse tissue atlas, Concerto shows superior performance benchmarking against other top-performing methods. We also demonstrate Concerto recapitulates detailed COVID-19 disease variation through query-to-reference mapping. Concerto can operate on all genes and represents a fully data-driven approach with minimum prior distribution assumptions, eliminating the needs of PCA-like or autoencoder-like dimensionality reduction, which significantly reforms the current best practice. Concerto is a simple, straightforward, robust, and scalable framework, offering a brand new perspective to derive cell representations and can effectively satisfy the emerging paradigm of query-to-reference mapping in the era of atlas-level single-cell multimodal analysis.