ShinyLearner: A containerized benchmarking tool for machine-learning classification of tabular data

Piccolo, Stephen R.; Lee, Terry J.; Suh, Erica; Hill, Kimball T

doi:10.1101/675181

Cited by 2 publications

(2 citation statements)

References 58 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We used 50 classification algorithms that were implemented in the ShinyLearner tool, which enables researchers to benchmark algorithms that are included in open-source machine-learning libraries; these libraries are redistributed as software containers(77,78). Via ShinyLearner, we used algorithm implementations from the mlr R package (version 2; R version 3.5)(79), sklearn Python module (versions 0.18-0.22)(80), and Weka Java application (version 3.6)(81).…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Benchmarking 50 classification algorithms on 50 gene-expression datasets

Piccolo

2021

Preprint

Self Cite

View full text Add to dashboard Cite

By classifying patients into subgroups, clinicians can provide more effective care than using a uniform approach for all patients. Such subgroups might include patients with a particular disease subtype, patients with a good (or poor) prognosis, or patients most (or least) likely to respond to a particular therapy. Diverse types of biomarkers have been proposed for assigning patients to subgroups. For example, DNA variants in tumors show promise as biomarkers; however, tumors exhibit considerable genomic heterogeneity. As an alternative, transcriptomic measurements reflect the downstream effects of genomic and epigenomic variations. However, high-throughput technologies generate thousands of measurements per patient, and complex dependencies exist among genes, so it may be infeasible to classify patients using traditional statistical models. Machine-learning classification algorithms can help with this problem. However, hundreds of classification algorithms exist, and most support diverse hyperparameters, so it is difficult for researchers to know which are optimal for gene-expression biomarkers. We performed a benchmark comparison, applying 50 classification algorithms to 50 gene-expression datasets (143 class variables). We evaluated algorithms that represent diverse machine-learning methodologies and have been implemented in general-purpose, open-source, machine-learning libraries. When available, we combined clinical predictors with gene-expression data. Additionally, we evaluated the effects of performing hyperparameter optimization and feature selection in nested cross-validation folds. Kernel- and ensemble-based algorithms consistently outperformed other types of classification algorithms; however, even the top-performing algorithms performed poorly in some cases. Hyperparameter optimization and feature selection typically improved predictive performance, and univariate feature-selection algorithms outperformed more sophisticated methods. Together, our findings illustrate that algorithm performance varies considerably when other factors are held constant and thus that algorithm selection is a critical step in biomarker studies.

show abstract

Section: Methodsmentioning

confidence: 99%

“…For feature selection, we used 14 algorithms that had been implemented in ShinyLearner (78). Table 1 lists each of the algorithms, along with a description and high-level category for each algorithm.…”

Section: Algorithms Usedmentioning

confidence: 99%

Benchmarking 50 classification algorithms on 50 gene-expression datasets

Piccolo

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

ShinyLearner: A containerized benchmarking tool for machine-learning classification of tabular data

et al. 2020

View full text Add to dashboard Cite

Background Classification algorithms assign observations to groups based on patterns in data. The machine-learning community have developed myriad classification algorithms, which are used in diverse life science research domains. Algorithm choice can affect classification accuracy dramatically, so it is crucial that researchers optimize the choice of which algorithm(s) to apply in a given research domain on the basis of empirical evidence. In benchmark studies, multiple algorithms are applied to multiple datasets, and the researcher examines overall trends. In addition, the researcher may evaluate multiple hyperparameter combinations for each algorithm and use feature selection to reduce data dimensionality. Although software implementations of classification algorithms are widely available, robust benchmark comparisons are difficult to perform when researchers wish to compare algorithms that span multiple software packages. Programming interfaces, data formats, and evaluation procedures differ across software packages; and dependency conflicts may arise during installation. Findings To address these challenges, we created ShinyLearner, an open-source project for integrating machine-learning packages into software containers. ShinyLearner provides a uniform interface for performing classification, irrespective of the library that implements each algorithm, thus facilitating benchmark comparisons. In addition, ShinyLearner enables researchers to optimize hyperparameters and select features via nested cross-validation; it tracks all nested operations and generates output files that make these steps transparent. ShinyLearner includes a Web interface to help users more easily construct the commands necessary to perform benchmark comparisons. ShinyLearner is freely available at https://github.com/srp33/ShinyLearner. Conclusions This software is a resource to researchers who wish to benchmark multiple classification or feature-selection algorithms on a given dataset. We hope it will serve as example of combining the benefits of software containerization with a user-friendly approach.

show abstract

ShinyLearner: A containerized benchmarking tool for machine-learning classification of tabular data

Cited by 2 publications

References 58 publications

Benchmarking 50 classification algorithms on 50 gene-expression datasets

Benchmarking 50 classification algorithms on 50 gene-expression datasets

ShinyLearner: A containerized benchmarking tool for machine-learning classification of tabular data

Contact Info

Product

Resources

About