“…From left to right, the columns respectively denote the architecture, the number of trainable parameters, the maximum GPU memory usage during training (data augmentation included), the minimum transcription level required, the minimum segmentation level required, the use of PreTraining (PT) on subimages, the use of specific Curriculum Learning (CL) and finally the Hyperparameter Adaptation (HA) requirements from one dataset to another. As one can see, models from [4,6,21] require transcription and segmentation labels at word or line levels to be trained, which implies more costly annotations. The models from [1,2,7] and the SPAN are pretrained on text line images to speed up convergence and to reach better results, thus also using line segmentation and transcription labels even if it is not strictly necessary.…”