Traffic congestion is a prevalent problem in modern civilizations worldwide, affecting both large cities and smaller communities. Emergency vehicles tend to group tightly together in these crowded scenarios, often masking one another. For traffic surveillance systems tasked with maintaining order and executing laws, this poses serious difficulties. Recent developments in machine learning for image processing have significantly increased the accuracy and effectiveness of emergency vehicle classification (EVC) systems, especially when combined with specialized hardware accelerators. The widespread use of these technologies in safety and traffic management applications has led to more sustainable transportation infrastructure management. Vehicle classification has traditionally been carried out manually by specialists, which is a laborious and subjective procedure that depends largely on the expertise that is available. Furthermore, erroneous EVC might result in major problems with operation, highlighting the necessity for a more dependable, precise, and effective method of classifying vehicles. Although image processing for EVC involves a variety of machine learning techniques, the process is still labor intensive and time consuming because the techniques now in use frequently fail to appropriately capture each type of vehicle. In order to improve the sustainability of transportation infrastructure management, this article places a strong emphasis on the creation of a hardware system that is reliable and accurate for identifying emergency vehicles in intricate contexts. The ResNet50 model’s features are extracted by the suggested system utilizing a Field Programmable Gate Array (FPGA) and then optimized by a multi-objective genetic algorithm (MOGA). A CatBoost (CB) classifier is used to categorize automobiles based on these features. Overtaking the previous state-of-the-art accuracy of 98%, the ResNet50-MOP-CB network achieved a classification accuracy of 99.87% for four primary categories of emergency vehicles. In tests conducted on tablets, laptops, and smartphones, it demonstrated excellent accuracy, fast classification times, and robustness for real-world applications. On average, it took 0.9 nanoseconds for every image to be classified with a 96.65% accuracy rate.