It has become mainstream in computer vision and other machine learning domains to reuse backbone networks pre-trained on large datasets as preprocessors. Typically, the last layer is replaced by a shallow learning machine of sorts; the newly-added classification head and (optionally) deeper layers are fine-tuned on a new task. Due to its strong performance and simplicity, a common pre-trained backbone network is ResNet152. However, ResNet152 is relatively large and induces inference latency. In many cases, a compact and efficient backbone with similar performance would be preferable over a larger, slower one. This paper investigates techniques to reuse a pre-trained backbone with the objective of creating a smaller and faster model. Starting from a large ResNet152 backbone pre-trained on ImageNet, we first reduce it from 51 blocks to 5 blocks, reducing its number of parameters and FLOPs by more than 6 times, without significant performance degradation. Then, we split the model after 3 blocks into several branches, while preserving the same number of parameters and FLOPs, to create an ensemble of sub-networks to improve performance. Our experiments on a large benchmark of 40 image classification datasets from various domains suggest that our techniques match the performance (if not better) of "classical backbone fine-tuning" while achieving a smaller model size and faster inference speed.
Background and motivationsOver the last decade, Deep Learning has set new standards in computer vision. Tasks in this area include the recognition of street signs, placards, and living beings. While it has achieved state-of-the-art in various academic and industrial fields, training deep networks from scratch requires massive amounts of data and hours of GPU training, which prevents it from being deployed in data-scarce and resource-scarce scenarios.This limitation has been mainly addressed through the notion of Transfer learning [1]. Here, knowledge is transferred from a source domain (typically learned from a large dataset) to one or several target domains (typically with less available data). A common transfer learning approach is last-layer fine-tuning [2], in which a considerable part (the backbone) of a pre-trained deep network is reused; only the last layer is replaced with a new classifier and trained to the new task at hand. Depending on the distribution shift between the source domain and target domains, more layers may be fine-tuned. Pre-trained networks that have been used for fine-tuning range from the historical AlexNet [3] to various ResNets [4,5].Modern neural networks are thought of as being "the bigger, the better" as big networks keep beating large benchmarks (such as ImageNet [6]). However, they are considerably over-parameterized when applied to smaller tasks. There is evidence that low-complexity models can, in some conditions, lead to comparably good or better performance [7]. Our charter is to elaborate on the basic "Reuse" methodology described above (replacing the last layer with a new classifier) by app...