This paper presents a benchmark of supervised Automated Machine Learning (AutoML) tools. Firstly, we analyze the characteristics of eight recent open-source AutoML tools (Auto-Keras, Auto-PyTorch, Auto-Sklearn, AutoGluon, H2O AutoML, rminer, TPOT and TransmogrifAI) and describe twelve popular OpenML datasets that were used in the benchmark (divided into regression, binary and multi-class classification tasks). Then, we perform a comparison study with hundreds of computational experiments based on three scenarios: General Machine Learning (GML), Deep Learning (DL) and XGBoost (XGB). To select the best tool, we used a lexicographic approach, considering first the average prediction score for each task and then the computational effort. The best predictive results were achieved for GML, which were further compared with the best OpenML public results. Overall, the best GML AutoML tools obtained competitive results, outperforming the best OpenML models in five datasets. These results confirm the potential of the general-purpose AutoML tools to fully automate the Machine Learning (ML) algorithm selection and tuning.
Automation and scalability are currently two of the main challenges of Machine Learning (ML). This paper proposes an automated and distributed ML framework that automatically trains a supervised learning model and produces predictions independently of the dataset and with minimum human input. The framework was designed for the domain of telecommunications risk management, which often requires supervised learning models that need to be quickly updated by non-ML-experts and trained on vast amounts of data. Thus, the architecture assumes a distributed environment, in order to deal with big data, and Automated Machine Learning (AutoML), to select and tune the ML models. The framework includes several modules: task detection (to detect if classification or regression), data preprocessing, feature selection, model training and deployment. In this paper, we detail the model training module. In order to select the computational technologies to be used in this module, we first analyzed the capabilities of an initial set of five modern AutoML tools: Auto-Keras, Auto-Sklearn, Auto-Weka, H2O AutoML, and TransmogrifAI. Then, we performed a benchmarking of the only two tools that address a distributed ML (H2O AutoML and TransmogrifAI). Several comparison experiments were held using three real-world datasets from the telecommunications domain (churn, event forecasting, and fraud detection), allowing to measure the computational effort and predictive capability of the AutoML tools. d
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.