The application of digital products in smart city results in ever-increasing 3D model data and how to obtain the relevant 3D model becomes a crucial issue. In this paper, we propose the Multi-View Tree Structure (MVTS) learning for 3D model retrieval and recognition. MVTS contains three key consecutive modules. Firstly, the visual feature learning module extracts the visual features of multiple views. Then, we design a score matrix to estimate the value of contextual information between view pairs. Based on the score matrix, a maximum spanning tree is constructed to further explore the contextual information within multiple views. Then, we utilize the bidirectional Tree-LSTM to encode the contextual information among views and the spatial information of tree structure and optimize the tree parameters. After that, the tree attention strategy is adopted to explore the importance of each view. Comparing to existing methods, our proposed method explores the spatial information of 3D model without the requirement of specific camera settings, which is more suitable for real applications. Moreover, our method jointly realizes the feature learning, view-wise contextual information and tree spatial information encoding and view importance estimating, which enhances the discrimination of the 3D model representation. Extensive experimental results on Modelnet40 and ShapeNetCore55 demonstrate the superiority of our method.