Currently, most multimodal analysis methods focus mainly on single-cell and spatial transcriptomic data, neglecting joint analysis with phenotypes and therefore lacking biological interpretability at the phenotypic level. Considering that bulk RNA-seq harbors a wealth of valuable clinical phenotype information, we developed Single-Cell and Tissue Phenotype prediction (SCTP), a multimodal fusion framework based on deep learning. SCTP can simultaneously detect phenotype-specific cells and characterize the tumor microenvironment of pathological tissue by integrating the essential information from the bulk sample phenotype, the composition of individual cells and the spatial distribution of cells. After evaluating the efficiency and robustness of SCTP compared with traditional approaches, a specific model was constructed as SCTP-CRC using RNA-seq, sc-RNAseq and spatial transcriptome data of colorectal cancer (CRC). SCTP-CRC helps unveil tumor-associated cells and clusters and continuously defines boundary regions as well as the spatial organization of the entire tumor microenvironment, delineating cellular communication networks with the dynamics of tumor transition. Moreover, SCTP-CRC extends to the identification of abnormal sub-regions in the early state of CRC and uncovers potential early-warning signature genes such as MMP2, IGKC and PIGR. Most of these new CRC signatures may also be relevant to liver metastases arising from CRC and distinguish them from primary liver cancer. SCTP promises a deeper understanding of the tumor microenvironment, quantitative characterization of cancer hallmarks, and the underlying complex molecular and cellular interplays. Thus, SCTP has great ability to support early diagnosis and personalized treatment of colorectal cancer or other complex diseases.