The increasing global energy shortage necessitates a swift transition to renewable energy sources, with wind power emerging as a cost-effective and environmentally friendly option. However, the non-stationary, random, and intermittent nature of wind poses challenges for power grid management, leading to inefficiency and energy supply-demand imbalances. To address these issues, various computational methods have been developed, including multi-modal machine learning methods. However, existing ANN or LSTM-based methods fail to learn complex patterns from heterogeneous features efficiently, especially when NWP forecast map data is considered. Thus, this paper presents a multi-modal transformer-based deep learning approach for wind power prediction, utilizing transformer model and vision transformer model to handle both historical turbine-level time-series data and NWP forecast map data. Extensive experiments demonstrate the model's ability to handle heterogeneous features effectively, outperforming benchmark algorithms in wind power prediction.