With the rapid expansion and ubiquitous presence of the Internet of Things (IoT), the proliferation of IoT devices has reached unprecedented levels, heightening concerns about IoT security. Intrusion detection based on deep learning has become a crucial approach for safeguarding IoT ecosystems. However, challenges remain in IoT intrusion detection research, including inadequate feature representation at the classifier level and poor correlation among extracted traffic features, leading to diminished classification accuracy. To address these issues, we propose a novel transformer-based IoT intrusion detection model, MBConv-ViT (MobileNet Convolution and Vision Transformer), which enhances the correlation of extracted features by fusing local and global features. By leveraging the high correlation of traffic flow, our model can identify subtle differences in IoT traffic flow, thereby achieving precise classification of attack traffic. Experiments based on the open datasets TON-IoT and Bot-IoT demonstrate that the accuracy of the MBConv-ViT model, respectively, 97.14% and 99.99%, is more effective than several existing typical models.