The application of computer vision in transportation engineering has facilitated real-time traffic flow optimization, vehicle counting, anomaly detection, and ameliorated transportation safety. Most vision systems are, however, developed through a supervised learning process, which can be data hungry and costly because it requires manual annotation of objects from a variety of sources. The general rule of thumb for building accurate and transferrable vision models has been to increase the quality, diversity, and quantity of the annotated datasets used in model training. This paper presents a simple, yet efficient active learning framework that significantly reduces the number of annotations needed to build a state-of-the-art vehicle detection and classification model. To achieve this, we first leverage a vision transformer that generates embeddings rich with information needed to quantify the similarity and diversity between images in a two-dimensional embedding space. To select which images from the embedding space should be annotated, we propose a scoring and sampling strategy that minimizes class imbalance and model uncertainty through an iterative process. The latest iteration of the You Only Look Once (YOLO) model, YOLOv8, is used as the active learner. We compare the efficacy of our proposed active learning methods with models developed at much higher sampling rates using the mean average precision. The models developed were also integrated with tracking algorithms to evaluate differences in accuracy for vehicle counts and their practical implications for direction counts.