Designing routing schedules is a pivotal aspect of smart delivery systems. Therefore, the field has been blooming for decades, and numerous algorithms for this task have been proposed for various formulations of rich vehicle routing problems. There is, however, an important gap in the state of the art that concerns the lack of an established and widely-adopted approach toward thorough verification and validation of such algorithms in practical scenarios. We tackle this issue and propose a comprehensive validation approach that can shed more light on functional and non-functional abilities of the solvers. Additionally, we propose novel similarity metrics to measure the distance between the routing schedules that can be used in verifying the convergence abilities of randomized techniques. To reflect practical aspects of intelligent transportation systems, we introduce an algorithm for elaborating solvable benchmark instances for any vehicle routing formulation, alongside the set of quality metrics that help quantify the real-life characteristics of the delivery systems, such as their profitability. The experiments prove the flexibility of our approach through utilizing it to the NP-hard pickup and delivery problem with time windows, and present the qualitative, quantitative, and statistical analysis scenarios which help understand the capabilities of the investigated techniques. We believe that our efforts will be a step toward the more critical and consistent evaluation of emerging vehicle routing (and other) solvers, and will allow the community to easier confront them, thus ultimately focus on the most promising research avenues that are determined in the quantifiable and traceable manner.