Biodiversity conservation is a critical environmental challenge, with accurate assessment being essential for conservation efforts. This study addresses the limitations of current plant diversity assessment methods, particularly in recognizing mixed and stunted grass species, by developing an enhanced species recognition approach using unmanned aerial vehicle (UAV) hyperspectral data and deep learning models in the steppe region of Xilinhot, Inner Mongolia. We compared five models—support vector machine (SVM), two-dimensional convolutional neural network (2D-CNN), three-dimensional convolutional neural network (3D-CNN), hybrid spectral CNN (HybridSN), and the improved HybridSN+—for grass species identification. The results show that SVM and 2D-CNN models have relatively poor recognition effects on mixed distribution and stunted individuals, while HybridSN and HybridSN+ models can effectively identify important grass species in the region, and the recognition accuracy of the HybridSN+ model can reach 96.45 (p < 0.05). Notably, the 3D-CNN model’s recognition performance was inferior to the HybridSN model, especially for densely populated and smaller grass species. The HybridSN+ model, optimized from the HybridSN model, demonstrated improved recognition performance for smaller grass species individuals under equivalent conditions, leading to a discernible enhancement in overall accuracy (OA). Diversity indices (Shannon–Wiener diversity, Simpson diversity, and Pielou evenness) were calculated using the identification results from the HybridSN+ model, and spatial distribution maps were generated for each index. A comparative analysis with diversity indices derived from ground survey data revealed a strong correlation and consistency, with minimal differences between the two methods. This study provides a feasible technical approach for efficient and meticulous biodiversity assessment, offering crucial scientific references for regional biodiversity conservation, management, and restoration.