In this rapidly evolving era of multimodal generation, diffusion models exhibit impressive generative capabilities, significantly enhancing the realm of creative image synthesis by intricately textual prompts. Yet, their effectiveness is limited in certain niche sectors, like depicting Chinese ancient architecture. This limitation is primarily due to the insufficient data that fails to encompass the unique architectural features and corresponding text information. Hence, we build an extensive multimodal dataset capturing the essence of Chinese architectures mostly from the Tang to the Yuan Dynasties. The dataset is categorized on the types, including image&text, video, and style models. In details, images and videos are methodically categorized based on locations. All images are annotated at two levels: initial annotations and descriptive terms based on distinctive characteristics and official information. Moreover, seven artistic styles fine-tuning models are provided in our dataset for further innovations. Significantly, this is the first Chinese ancient architecture dataset and the instance of using the Pinyin system to annotate unique terms related to Chinese architectural styles.