Recently, vector-quantized image modeling has demonstrated impressive performance on generation tasks such as text-to-image generation. However, we discover that the current image quantizers do not satisfy translation equivariance in the quantized space due to aliasing, degrading performance in the downstream text-to-image generation and image-to-text generation, even in simple experimental setups. Instead of focusing on anti-aliasing, we take a direct approach to encourage translation equivariance in the quantized space. In particular, we explore a desirable property of image quantizers, called 'Translation Equivariance in the Quantized Space' and propose a simple but effective way to achieve translation equivariance by regularizing orthogonality in the codebook embedding vectors. Using this method, we improve accuracy by +22% in text-to-image generation and +26% in image-to-text generation, outperforming the VQGAN.
Semiconductor wafer defects severely affect product development. In order to reduce the occurrence of defects, it is necessary to identify why they occur, and it can be inferred by analyzing the patterns of defects. Automatic defect classification (ADC) is used to analyze large amounts of samples. ADC can reduce human resource requirements for defect inspection and improve inspection quality. Although several ADC systems have been developed to identify and classify wafer surfaces, the conventional ML-based ADC methods use numerous image recognition features for defect classification and tend to be costly, inefficient, and time-consuming. Here, an ADC technique based on a deep ensemble feature framework (DEFF) is proposed that classifies different kinds of wafer surface damage automatically. DEFF has an ensemble feature network and the final decision network layer. The feature network learns features using multiple pre-trained convolutional neural network (CNN) models representing wafer defects and the ensemble features are computed by concatenating these features. The decision network layer decides the classification labels using the ensemble features. The classification performance is further enhanced by using a voting-based ensemble learning strategy in combination with the deep ensemble features. We show the efficacy of the proposed strategy using the real-world data from SK Hynix.
Recently a number of studies demonstrated impressive performance on diverse vision-language multi-modal tasks such as image captioning and visual question answering by extending the BERT architecture with multi-modal pre-training objectives. In this work we explore a broad set of multi-modal representation learning tasks in the medical domain, specifically using radiology images and the unstructured report. We propose Medical Vision Language Learner (MedViLL) which adopts a Transformer-based architecture combined with a novel multimodal attention masking scheme to maximize generalization performance for both vision-language understanding tasks (image-report retrieval, disease classification, medical visual question answering) and vision-language generation task (report generation). By rigorously evaluating the proposed model on four downstream tasks with two chest X-ray image datasets (MIMIC-CXR and Open-I), we empirically demonstrate the superior downstream task performance of MedViLL against various baselines including task-specific architectures.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.