In this work we examine the problematic associated to the development of machine learning models to achieve robust generalization capabilities on common-task multiple-database scenarios. Referred as the "database variability problem", we focus on a specific medical domain (sleep staging in Sleep Medicine) to show the non-triviality of translating the estimated model's local generalization capabilities to independent external databases. We analyze some of the scalability problems when multiple-database data are used as input to train a single learning model. Then, we introduce a novel approach based on an ensemble of local models, and we show its advantages in terms of inter-database generalization performance and data scalability. Further on, we analyze different model configurations and data preprocessing techniques to evaluate their effects over the overall generalization performance. For this purpose we carry out experimentation involving several sleep databases evaluating different machine learning models based on Convolutional Neural Networks.
Objective: To assess the validity of an automatic EEG arousal detection algorithm using large patient samples and different heterogeneous databases Methods: Automatic scorings were confronted with results from human expert scorers on a total of 2768 full-night PSG recordings obtained from two different databases. Of them, 472 recordings were obtained during clinical routine at our sleep center, and were subdivided into two subgroups of 220 (HMC-S) and 252 (HMC-M) recordings each, attending to the procedure followed by the clinical expert during the visual review (semi-automatic or purely manual, respectively). In addition, 2296 recordings from the public SHHS-2 database were evaluated against the respective manual expert scorings. Results: Event-by-event epoch-based validation resulted in an overall Cohen's kappa agreement κ = 0.600 (HMC-S), 0.559 (HMC-M), and 0.573 (SHHS2). Estimated inter-scorer variability on the datasets was, respectively, κ = 0.594, 0.561 and 0.543. Analyses of the corresponding Arousal Index scores showed associated automatic-human repeatability indices ranging in 0.693-0.771 (HMC-S), 0.646-0.791 (HMC-M), and 0.759-0.791 (SHHS2).Conclusions: Large-scale validation of our automatic EEG arousal detector on different databases has shown robust performance and good generalization results comparable to the expected levels of human agreement. Special emphasis has been put on allowing reproducibility of the results and implementation of our method has been made accessible online as open source code.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.