To overcome limitations in existing methods for sentiment analysis of tourism reviews, the authors propose an image-text multimodal sentiment analysis method (TBGAV). It consists of three modules: image sentiment extraction, text sentiment extraction, and image-text fusion. The image sentiment extraction module employs a pre-trained VGG19 model to capture sentiment features. The text sentiment extraction module utilizes the tiny bidirectional encoder representations from transformers (TinyBERT) model, incorporating the bidirectional recurrent neural network and attention (BiGRU-Attention) module for deeper sentiment semantics. The image-text fusion module employs the dual linear fusion approach to correlate image-text links and the maximum decision-making approach for high-precision sentiment prediction. TBGAV achieves superior performance on the Yelp dataset with accuracy, recall rates, and F1 scores of 77.51%, 78.01%, and 78.34%, respectively, outperforming existing methods. Accordingly, TBGAV is expected to help improve travel-related recommender systems and marketing strategies.