Visual search technology, because of its convenience and high efficiency, is widely used by major tourism e-commerce platforms in product search functions. This study introduces an innovative visual search engine model, namely CLIP-ItP, aiming to thoroughly explore the application potential of visual search in tourism e-commerce. The model is an extension of the CLIP (contrastive language-image pre-training) framework and is developed through three pivotal stages. Firstly, by training an image feature extractor and a linear model, the visual search engine labels images, establishing an experimental visual search engine. Secondly, CLIP-ItP jointly trains multiple text and image encoders, facilitating the integration of multimodal data, including product image labels, categories, names, and attributes. Finally, leveraging user-uploaded images and jointly selected product attributes, CLIP-ItP provides personalized top-k product recommendations.