The effective use of multimodal data to obtain accurate land cover information has become an interesting and challenging research topic in the field of remote sensing. In this paper, we propose a new method, multi-scale learning and attention enhancement network (MSLAENet), to implement Hyperspectral image (HSI) and light detection and ranging (LiDAR) data fusion classification in an end-to-end manner. Specifically, our model consists of three main modules. First, we design the composite attention (CA) module, which adopts self-attention to enhance the feature representations of HSI and LiDAR data, respectively, and cross-attention to achieve crossmodal information enhancement. Second, the proposed multiscale learning (MSL) module combines self-calibrated convolutions and hierarchical residual structure to extract different scales of information to further improve the representation capability of the model. Finally, the attention-based feature fusion (FF) module fully considers the complementary information properties between different modalities and adaptively fuses heterogeneous features from different modalities. To test the performance of MSLAENet, we conduct experiments on three multimodal remote sensing datasets and compare them with the state-ofthe-art fusion model, which demonstrates the effectiveness and superiority of the model.