At present, most graphic design methods of 3D animation scenes only get a small part of 3D animation scenes, and all of them are in a 3D coordinate system, with observers as the core, so it is difficult to express the depth information of 3D animation scenes. This project intends to study a plane design method for 3D animation scenes based on deep learning. CNN (Convolutional Neural Network) is used to build a multi-view 3D animation scene generation network, and 3D geometry and structure of objects are reconstructed through multiple or a group of images. On this basis, the feature extraction method of 3D animation scenes is studied, and the collaborative learning model of multiple networks is established to improve the modeling accuracy of 3D animation scenes. The experimental results show that the network model is superior to the method based on multi-view and 0-1 voxel in detecting retrieval performance, and the accuracy rate can reach 91.725%. The multi-view 3D animation scene generation method in this paper has achieved better results than the current advanced methods, which proves that the multi-view feature fusion network proposed in this paper is a more reasonable method to fuse multi-view image features.