Despite significant progress on the mesh-based Best Viewpoint Selection (BVS) problem using multi-views, the current state-of-the-art BVS method requires 20-30 rendering views and is limited to selecting from a predefined sequence of viewpoint samples, which may miss optimal viewpoints and precludes its use in situations where response time is critical. To address the limitations, we present a new dual-branch fast best viewpoint selection regression model that significantly reduces reliance on a large number of input views, enables continuous perspective prediction, and enhances interactive response speeds. Our approach incorporates a geometry-enhanced multi-view feature extractor paired with a learnable token and utilizes a cross-modal distillation method to enrich the model’s understanding of 3D structures. Specifically, by embedding view features into a dimensionally matched learnable token and processing it through three cascaded self-attention layers, the resulting token effectively encapsulates fused features that are better suited to the viewpoint selection task.In addition, to reduce the number of views required, we incorporate cross-modal distillation into the BVS solution by imposing alignment constraints between 3D geometry descriptors and fused multi-view expressions, effectively avoiding the computational requirements of dozens of rendering views. Our experimental results on public benchmarks show that our method is approximately 35 times faster than the SOTA method when only six views are adopted, while also achieving the best quantitative metrics.