The increasing variety of geophysical data sets enhances inversion constraints but also poses significant challenges for conventional methods in terms of precision and efficiency due to factors like non‐linearity and limited observation coverage. While deep learning has the potential to address these challenges, issues such as data representation and domain transformation limit the development of universally applicable deep learning‐based inversion methods across various geophysical scenarios. To address these issues, we propose a novel Transformer‐based inversion framework named G(eophysics)‐Query for multimodal geophysical data by expanding the vector representation concept and incorporating query adjusting in the attention mechanism. It can adapt to diverse inversion settings—across different observation systems, targets, and prior information—within a single network, and thus eliminates the need for specific design adjustments and repeated training. Additionally, the framework incorporates a novel semi‐supervised training strategy to explore empirical relationships for prediction uncertainty estimation, ultimately providing more comprehensive inversion results. The framework is applied in the joint inversion of surface waves and receiver functions and successfully obtains the lithospheric structure of the conterminous U.S., demonstrating high precision, efficiency, and universal capability.