2021
DOI: 10.21105/joss.03073
|View full text |Cite
|
Sign up to set email alerts
|

mikropml: User-Friendly R Package for Supervised Machine Learning Pipelines

Abstract: Machine learning (ML) for classification and prediction based on a set of features is used to make decisions in healthcare, economics, criminal justice and more. However, implementing an ML pipeline including preprocessing, model selection, and evaluation can be time-consuming, confusing, and difficult. Here, we present mikropml (prononced "meek-ROPE em el"), an easy-to-use R package that implements ML pipelines using regression, support vector machines, decision trees, random forest, or gradient-boosted trees… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
29
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
8

Relationship

2
6

Authors

Journals

citations
Cited by 42 publications
(29 citation statements)
references
References 14 publications
0
29
0
Order By: Relevance
“…We used the mikropml package to train and evaluate models to predict C. difficile colonization status at 10 days postchallenge where mice were categorized as either cleared or colonized ( 77 , 78 ). We removed the C. difficile genus relative abundance data prior to training the model.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…We used the mikropml package to train and evaluate models to predict C. difficile colonization status at 10 days postchallenge where mice were categorized as either cleared or colonized ( 77 , 78 ). We removed the C. difficile genus relative abundance data prior to training the model.…”
Section: Methodsmentioning
confidence: 99%
“…To accommodate the small number of samples in our data set, we used 50% training and 50% testing splits with repeated 2-fold cross-validation of the training data for hyperparameter tuning. Permutation importance was performed as described previously ( 79 ) using mikropml ( 77 , 78 ) with the random forest model because it had the highest AUROC value.…”
Section: Methodsmentioning
confidence: 99%
“…Utilizing publicly available 16S rRNA sequence data from the stools of patients with SRNs and healthy controls, we generated taxonomic abundance tables with mothur ( 7 ) annotated to phylum, class, order, family, genus, OTU, and ASV levels. Using the taxonomic abundance data and the mikropml R package ( 8 ), we quantified how reliably samples could be classified as SRN or normal using five machine learning methods, including random forest, L2-regularized logistic regression, decision tree, gradient boosted trees (XGBoost), and support vector machine with radial basis kernel (SVM radial). Across the five machine learning methods tested, model performance increased with increasing taxonomic level usually peaking around genus/OTU level before dropping off slightly with ASVs (see Fig.…”
Section: Observationmentioning
confidence: 99%
“…Machine learning models were run with the R package mikropml (v0.0.2) ( 8 ) to predict the diagnosis category (normal versus SRN) of each sample. Data were preprocessed to normalize values (scale/center), remove values with zero or near-zero variance, and collapse colinear features using default parameters.…”
Section: Observationmentioning
confidence: 99%
“…We also oversampled the data so that the number of attacks against healthcare were approximately the same as the number of non-healthcare attacks through generation of ‘synthetic positive instances using ADASYN algorithm. The number of majority neighbors of each minority instance determines the number of synthetic instances generated from the minority instance’ 71 and fit the algorithm a second time using the mikropml R package 72 to produce 15 additional performance metrics for comparison with the original model. For both fitting processes, the categorical variables year, governorate, perpetrator and weapon were one hot encoded to indicator variables; the five infrastructure type variables were already represented by 1s and 0s and represented categorically to indicate if a strike was recorded as present or absent, respectively.…”
Section: Methodsmentioning
confidence: 99%