Labeling training datasets has become a key barrier to building medical machine learning models. One strategy is to generate training labels programmatically, for example by applying natural language processing pipelines to text reports associated with imaging studies. We propose cross-modal data programming, which generalizes this intuitive strategy in a theoretically-grounded way that enables simpler, clinician-driven input, reduces required labeling time, and improves with additional unlabeled data. In this approach, clinicians generate training labels for models defined over a target modality (e.g. images or time series) by writing rules over an auxiliary modality (e.g. text reports). The resulting technical challenge consists of estimating the accuracies and correlations of these rules; we extend a recent unsupervised generative modeling technique to handle this cross-modal setting in a provably consistent way. Across four applications in radiography, computed tomography, and electroencephalography, and using only several hours of clinician time, our approach matches or exceeds the efficacy of physician-months of hand-labeling with statistical significance, demonstrating a fundamentally faster and more flexible way of building machine learning models in medicine.Modern machine learning approaches have achieved impressive empirical successes on diverse clinical tasks that include predicting cancer prognosis from digital pathology, 1, 2 classifying skin lesions from dermatoscopy, 3 characterizing retinopathy from fundus photographs, 4 detecting intracranial hemorrhage through computed tomography, 5, 6 and performing automated interpretation of chest radiographs. 7,8 Remarkably, these applications typically build on standardized reference neural network architectures 9 supported in professionally-maintained open source frameworks, 10, 11 suggesting that model design is no longer a major barrier to entry in medical machine learning. However, each of these application successes was predicated on a not-so-hidden cost: massive hand-labeled training datasets, often produced through years of institutional investment and expert clinician labeling time, at a cost of hundreds of thousands of dollars per task or more. 4,12 In addition to being extremely costly, these training sets are inflexible: given a new classification schema, imaging system, patient population, or other change in the data distribution or modeling task, the training set generally needs to be relabeled from scratch. These factors suggest 1