The use of Deep Learning models to forecast geomagnetic storms is achieving great results. However, the evaluation of these models is mainly supported on generic regression metrics (such as the Root Mean Squared Error or the Coefficient of Determination), which are not able to properly capture the specific particularities of geomagnetic storms forecasting. Particularly, they do not provide insights during the high activity periods. To overcome this issue, we introduce the Binned Forecasting Error to provide a more accurate assessment of models' performance across the different intensity levels of a geomagnetic storm. This metric facilitates a robust comparison of different forecasting models, presenting a true representation of a model's predictive capabilities while being resilient to different storms duration. In this direction, for enabling fair comparison among models, it is important to standardize the sets of geomagnetic storms for model training, validation and testing. To do this, we have started from the current sets used in the literature for forecasting the SYM‐H, enriching them with newer storms not considered previously, focusing not only on disturbances caused by Coronal Mass Ejections but also addressing High‐Speed Streams. To operationalize the evaluation framework, a comparative study is conducted between a baseline neural network model and a persistence model, showcasing the effectiveness of the new metric in evaluating forecasting performance during intense geomagnetic storms. Finally, we propose the use of preliminary measurements from ACE to evaluate the model performance in settings closer to an operational real‐time scenario, where the forecasting models are expected to operate.