Related articles have outlined problems with the development of machine-learned solutions for health care and suggested a framework for their optimal development. 1,2 The spectrum of clinical settings in which machine learning approaches have been examined for use in the health care setting has increased markedly and become more diverse in recent years. Many studies have detailed the data science and statistical bases of machine-learned tools. 2 However, comparatively few studies have focused on their evaluation and implementation. 3 We discuss how to evaluate machine-learned solutions throughout their life cycle to optimize their use and functionality in clinical practice. Internal validation -that is, ascertaining the discriminative and calibration performance of an algorithm -should be followed by evaluation of both performance and outcomes of interest in the clinical setting, as well as evaluation of the tool's implementation into existing workflows (as outlined in Figure 1).
What is the process of model or algorithm development and interval validation?Initially, evaluation of the predictive performance of machinelearned algorithms involves assessing their discriminatory and calibration accuracy. The former quantifies the ability of the algorithm to separate individuals according to the presence or absence of a given outcome, and the latter measures how close the predicted probabilities are to actual probabilities. 4 Such experiments comprise the internal validation stage of machinelearned algorithm development and represent the majority of published reports describing machine learning in medicine. 3 Typically, studies determining the predictive performance and accuracy of diferent algorithms are retrospective in nature. Large, historically labelled data sets are used to train and test algorithms. 3,5 Machine learning methods employed at this stage range from relatively familiar approaches such as linear or logistic regression to more complex neural networks and natural language processing models. 5,6 In all cases, algorithms are first "trained" on the largest portion of the data reserved for this purpose, and then evaluated on the remaining data, referred to as the test data. [3][4][5] When the outcome of interest is binary (e.g., disease present or absent), performance is typically reported using Key points • Evaluation of machine-learned systems is a multifaceted process that encompasses internal validation, clinical validation, clinical outcomes evaluation, implementation research and postimplementation evaluation.
CMAJ 1Early release, published at www.cmaj.ca on August 30, 2021. Subject to revision.