Online Object Detection (OOD) algorithms play a crucial role in dynamic and real-world computer vision applications. In these scenarios, models are trained on a data stream where old class samples are revisited, a phenomenon known as Natural Replay (NR). During training, NR occurs unevenly across object categories, leading to evaluation metrics biased towards the most frequently revisited classes. Existing benchmarks lack proper quantification of NR and depict short-term training scenarios on a single domain. As a result, evaluating generalization capabilities and forgetting rates of models become challenging in OOD. In this paper, we address the challenges surrounding the evaluation of OOD models by proposing two key contributions. Firstly, we define a metric to quantify NR in an OOD scenario and show how NR is related to class specific forgetting. Secondly, we introduce a novel benchmark, EgOAK, which introduces a long-term training scenario that involves frequent domain shifts. It allows the evaluation of models' generalization capabilities and forgetting of knowledge on past domains. Our results in this OOD setting reveal that Experience Replay, a memory-based method, is particularly effective for better generalization to new domains and for preserving past knowledge. Leveraging replay from memory helps to address the low natural replay rate for rarely revisited classes, resulting in improved adaptability and reliability of models in dynamic environments.