Machine learning (ML) workflows enable unprejudiced and robust evaluation of complex datasets and are being increasingly sought in analyzing transcriptome-based big datasets. Here, we analysed over 490,000,000 data points to compare 10 different ML algorithms in a large (N=11,652) training dataset of single-cell RNA-sequencing of human pancreatic cells to identify features (genes) associated with the presence or absence of insulin gene transcript(s). Prediction accuracy and sensitivity of models were tested in a separate validation dataset (N=2,913 single-cell transcriptomes) and the efficacy of each ML workflow to accurately identify insulin-producing cells assessed. Overall, Ensemble ML workflows, and in particular, Random Forest ML algorithm delivered high predictive power in a receiver operator characteristic (ROC) curve analysis (AUC=0.83) at the highest sensitivity (0.98) as compared to the other nine algorithms. The top 10 features, (including IAPP, ADCYAP1, LDHA and SST) common to the three Ensemble ML workflows were identified to be localized to human islet-β cells as well as non-β cells and were significantly dysregulated in scRNA-seq datasets from Ire-1αβ-/- mice that demonstrate de-differentiation of pancreatic β-cells as well as in pancreatic single cells from individuals with Type 2 Diabetes. Our findings provide a direct comparison of ML workflows in big data analyses, identify key determinants of insulin transcription and provide workflows for other regulatory analyses to identify/validate novel genes/features of endocrine pancreatic gene transcription.