Due to the limitation of less information in a single image, it is very difficult to generate a high-precision 3D model based on the image. There are some problems in the generation of 3D voxel models, e.g., the information loss at the upper level of a network. To solve these problems, we design a 3D model generation network based on multi-modal data constraints and multi-level feature fusion, named as 3DMGNet. Moreover, 3DMGNet is trained by self-supervised method to achieve 3D voxel model generation from an image. The image feature extraction network (2DNet) and 3D feature extraction network (3D auxiliary network) are used to extract the features of the image and 3D voxel model. Then, feature fusion is used to integrate the low-level features into the high-level features in the 3D auxiliary network. To extract more effective features, each layer of the feature map in feature extraction network is processed by an attention network. Finally, the extracted features generate 3D models by a 3D deconvolution network. The feature extraction of 3D model and the generation of voxelization play an auxiliary role in the training of the whole network for the 3D model generation based on an image. Additionally, a multi-view contour constraint method is proposed, to enhance the effect of the 3D model generation. In the experiment, the ShapeNet dataset is adapted to prove the effect of the 3DMGNet, which verifies the robust performance of the proposed method.