Large database sources, such as the National Health and Nutrition Examination Survey (NHANES), while being a great utility for epidemiological studies, pose challenges for machine learning due to data heterogeneity, varied sample sizes, missing values/outliers and variations in data collection and interpretation requiring thorough data-quality assessment and cleaning. In addition, complex disease outcomes often display a high degree of clinical heterogeneity, necessitating deeper phenotypic subtyping. Here, we develop an integrated data cleaning-subtype discovery pipeline with unsupervised learning algorithms for comprehensive analysis and network-based/clustering visualization of data patterns and data outcomes. We apply this pipeline to NHANES, one of the largest curated repositories of population-level health-related indicators which includes a physical examination, blood biochemistry, self-reported surveys, and dietary intake data. We focus our investigations on dental caries which remains the most prevalent chronic disease affecting more than 3.5 billion people worldwide. Our multidimensional pipeline declutters and optimizes the NHANES data, including redundant variable types, to streamline data integration and create a ‘machine learning-ready’ version of the report. In addition, this approach reveals data patterns that led to the discovery of previously unrecognized subtypes and variables associated with the clinical phenotype heterogeneity of dental caries. We observed diverging patterns of similarity within different age groups and different variable subsets, while deriving unexpected associations of sleep deprivation and specific laboratory markers and the disease. Altogether, we report a comprehensive data processing approach that can guide the development of more precise and robust machine learning predictive models for dental caries and other health conditions from NHANES.