An increasing number of applications in the computer vision domain, specially, in medical imaging and remote sensing, are challenging when the goal is to classify very large images with tiny objects. More specifically, these type of classification tasks face two key challenges: i) the size of the input image in the target dataset is usually in the order of megapixels, however, existing deep architectures do not easily operate on such big images due to memory constraints, consequently, we seek a memory-efficient method to process these images; and ii) only a small fraction of the input images are informative of the label of interest, resulting in low region of interest (ROI) to image ratio. However, most of the current convolutional neural networks (CNNs) are designed for image classification datasets that have relatively large ROIs and small image size (sub-megapixel). Existing approaches have addressed these two challenges in isolation. We present an end-to-end CNN model termed Zoom-In network that leverages hierarchical attention sampling for classification of large images with tiny objects using a single GPU. We evaluate our method on two large-image datasets and one gigapixel dataset. Experimental results show that our model achieves higher accuracy than existing methods while requiring less computing resources.
IntroductionNeural networks have achieved state-of-the-art performance in many image classification tasks [1]. However, there are still many scenarios where neural networks can still be improved. Using modern deep neural networks on image inputs of very high resolution is a non-trivial problem due to the challenges of scaling model architectures [2]. Such images are common for instance in satellite or medical imaging. Moreover, these images tend to become even bigger due to the rapid growth in computational and memory availability, as well as the advancements in camera sensor technology. Specifically challenging are the so called tiny object image classification tasks, where the goal is to classify images based on the information of very small objects or regions of interest (ROIs), in the presence of a much larger background that is uncorrelated or non-informative of the label. Consequently, constituting an input image with a very low ROI-to-image ratio.Recent work [3] showed that with a dataset of limited size, convolutional neural networks (CNNs) have poor performance on very low ROI-to-image ratio problems. In these settings, the input resolution is increased from typical image sizes, e.g., 224 × 224 pixels, to gigapixel images of size ranging from 45, 056×35, 840 to 217, 088×111, 104 pixels [4], which not only requires significantly more computational processing power per image than a typical image given a fixed deep architecture, but in some cases, it becomes prohibitive for current GPU-memory standards. Figure 1 shows an example of a gigapixel image, from which we see that manually annotated ROIs (with cancer metastases), not usually available for model training, constitute a tiny proportion of ...