Transfer learning is a cornerstone of computer vision, yet little work has been done to evaluate the relationship between architecture and transfer. An implicit hypothesis in modern computer vision research is that models that perform better on ImageNet necessarily perform better on other vision tasks. However, this hypothesis has never been systematically tested. Here, we compare the performance of 16 classification networks on 12 image classification datasets. We find that, when networks are used as fixed feature extractors or fine-tuned, there is a strong correlation between ImageNet accuracy and transfer accuracy (r = 0.99 and 0.96, respectively). In the former setting, we find that this relationship is very sensitive to the way in which networks are trained on ImageNet; many common forms of regularization slightly improve ImageNet accuracy but yield penultimate layer features that are much worse for transfer learning. Additionally, we find that, on two small fine-grained image classification datasets, pretraining on ImageNet provides minimal benefits, indicating the learned features from Ima-geNet do not transfer well to fine-grained tasks. Together, our results show that ImageNet architectures generalize well across datasets, but ImageNet features are less general than previously suggested.
Neurons in the temporal lobe of both monkeys and humans show selective responses to classes of visual stimuli and even to specific individuals. In this study, we investigate the latency and selectivity of visually responsive neurons recorded from microelectrodes in the parahippocampal cortex, entorhinal cortex, hippocampus, and amygdala of human subjects during a visual object presentation task. During 96 experimental sessions in 35 subjects, we recorded from a total of 3278 neurons. Of these units, 398 responded selectively to one or more of the presented stimuli. Mean response latencies were substantially larger than those reported in monkeys. We observed a highly significant correlation between the latency and the selectivity of these neurons: the longer the latency the greater the selectivity. Particularly, parahippocampal neurons were found to respond significantly earlier and less selectively than those in the other three regions. Regional analysis showed significant correlations between latency and selectivity within the parahippocampal cortex, entorhinal cortex, and hippocampus, but not within the amygdala. The later and more selective responses tended to be generated by cells with sparse baseline firing rates and vice versa. Our results provide direct evidence for hierarchical processing of sensory information at the interface between the visual pathway and the limbic system, by which increasingly refined and specific representations of stimulus identity are generated over time along the anatomic pathways of the medial temporal lobe.
SignificanceThe anatomy and dynamics of different layers of the cerebral cortex are distinct. Physiological work in the sensory cortex has investigated how different layers process sensory inputs, and how they are engaged during attention tasks. In the frontal and prefrontal cortices, where lamination is present, very few studies have investigated the role of distinct layers for cognition. We studied frontal cortex laminar neuronal activity as monkeys performed working memory tasks. Spiking and gamma-band activity (50–150 Hz) in the superficial layers reflected active maintenance of working memories. Alpha/beta frequencies (4–22 Hz) in the deep layers modulated the gamma activity in the superficial layers. This might serve a control function, allowing information to enter or exit active storage in superficial layers.
We introduce new methods for 1) accelerating and 2) stabilizing training for large language-vision models. 1) Towards accelerating training, we introduce Switch-Back, a linear layer for int8 quantized training which provides a speed-up of 13-25% while matching the performance of bfloat16 training within 0.1 percentage points for the 1B parameter CLIP ViT-Huge-the largest int8 training to date. Our main focus is int8 as GPU support for float8 is rare, though we also analyze float8 training through simulation. While SwitchBack proves effective for float8, we show that standard techniques are also successful if the network is trained and initialized so that large feature magnitudes are discouraged, which we accomplish via layer-scale initialized with zeros. 2) Towards stable training, we analyze loss spikes and find they consistently occur 1-8 iterations after the squared gradients become underestimated by their AdamW second moment estimator. As a result, we recommend an AdamW-Adafactor hybrid, which we refer to as StableAdamW because it avoids loss spikes when training a CLIP ViT-Huge model and outperforms gradient clipping.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.