Reproducible definition and identification of cell types is essential to enable investigations into their biological function, and understanding their relevance in the context of development, disease and evolution. Current approaches model variability in data as continuous latent factors, followed by clustering as a separate step, or immediately apply clustering on the data. Clusters obtained in this manner are considered as putative cell types in atlas-scale efforts such as those for mammalian brains. We show that such approaches can suffer from qualitative mistakes in identifying cell types robustly, particularly when the number of such cell types is in the hundreds or even thousands. Here, we propose an unsupervised method, MMIDAS (Mixture Model Inference with Discrete-coupled AutoencoderS), which combines a generalized mixture model with a multi-armed deep neural network, to jointly infer the discrete type and continuous type-specific variability. We develop this framework in a way that can be applied to analysis of both uni-modal and multi-modal datasets. Using four recent datasets of brain cells spanning different technologies, species, and conditions, we demonstrate that MMIDAS significantly outperforms state-of-the-art models in inferring interpretable discrete and continuous representations of cellular identity, and uncovers novel biological insights. Our unsupervised framework can thus help researchers identify more robust cell types, study cell type-dependent continuous variability, interpret such latent factors in the feature domain, and study multi-modal datasets.