Multimodal biometric system has received increasing interest as it offers a more secure and accurate authentication solution than unimodal systems. However, existing biometric fusion methods are still inadequate in dealing with correlations and redundancy of multimodal features simultaneously, causing bottlenecks in performance improvement. To overcome the above problem, this paper proposes an end-toend multimodal finger recognition model that incorporates attention mechanisms into a similarity-aware encoder to achieve accurate recognition results for accurate identification results. Firstly, due to the different distribution of fingerprint and finger vein images, we propose a finger asymmetric backbone network (FAB-Net) for extracting discriminative intra-modal features, which reduces the network width by efficient utilization of feature maps. Then, a novel attention-based encoder fusion network (AEF-Net) with fused similarity performs dimensionality reduction-based fusion on multimodal multilevel features to alleviate performance degradation due to information redundancy. We also introduce channel attention in AEF-Net, which differs from the traditional attention mechanism by considering interdependencies between modalities to further improve performance. Extensive recognition experiments are conducted on three multimodal finger databases to verify the effectiveness of our method compared to state-of-the-art methods. Detailed ablation studies have also been carried out, which demonstrated that encoder-based reconstruction of redundant information can improve recognition performance.INDEX TERMS Multimodal biometric recognition, feature fusion, autoencoder, deep learning.