In recent years, artworks have been increasingly digitized and built into databases, and such databases have become convenient tools for researchers. Researchers who retrieve artwork are not only researchers of humanities, but also researchers of materials science, physics, art, and so on. It may be difficult for researchers of various fields whose studies focus on the colors of artwork to find the required records in existing databases, that are color-based and only queried by the metadata. Besides, although some image retrieval engines can be used to retrieve artwork by text description, the existing image retrieval systems mainly retrieve the main colors of the images, and rare cases of color use are difficult to find. This makes it difficult for many researchers who focus on toning, colors, or pigments to use search engines for their own needs. To solve the two problems, we propose a cross-modal multi-task fine-tuning method based on CLIP (Contrastive Language-Image Pre-Training), which uses the human sensory characteristics of colors contained in the language space and the geometric characteristics of the sketches of a given artwork in order to gain better representations of that artwork piece. The experimental results show that the proposed retrieval framework is efficient for intuitively searching for rare colors, and that a small amount of data can improve the correspondence between text descriptions and color information.