Spatial transcriptomics is a powerful technology for high-resolution mapping of gene expression in tissue samples, enabling a molecular level understanding of tissue architecture. The acquisition entails dissecting and profiling micron-thick tissue slices, with multiple slices often needed for a comprehensive study. However, the lack of a common coordinate framework (CCF) among slices, due to slicing and displacement variations, can hinder data analysis, making data comparison and integration challenging, and potentially compromising analysis accuracy. Here we present a deep learning algorithm STaCker that unifies the coordinates of transcriptomic slices via an image registration process. STaCker derives a composite image representation by integrating tissue image and gene expressions that are transformed to be resilient to noise and batch effects. Trained exclusively on diverse synthetic data, STaCker overcomes the training data scarcity and is applicable to any tissue type. Its performance on various benchmarking datasets shows a significant increase in spatial concordance in aligned slices, surpassing existing methods. STaCker also successfully harmonizes multiple real spatial transcriptome datasets. These results indicate that STaCker is a valuable computational tool for constructing a CCF with spatial transcriptome data.