Geometry- and Accuracy-Preserving Random Forest Proximities

Rhodes, Jake S.; Cutler, Adele; Moon, Kevin R.

doi:10.48550/arxiv.2201.12682

Cited by 2 publications

(4 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In previous work, we have shown that ensembles built using such splits do a better job of modeling the underlying patterns in data ( 25 ). This is also well supported by other work in the field ( 15 , 17 , 24 , 41 ). By using LANDMark, TreeOrdination also minimizes the impact of noisy features through randomization (bootstrapping of training data at each node, random selection of features, and models) and regularization (most models selected for splitting are L1 or L2 regularized) ( 21 , 25 ).…”

Section: Discussionsupporting

confidence: 86%

“…Unlike statistical models, machine learning models tend not to assume anything about the underlying distribution of each feature ( 4 , 5 ). Furthermore, some machine learning models, such as random forest (RF) and related classifiers, are capable of identifying dependencies between features without the need for the user to explicitly include these dependencies in the model ( 11 , 14 – 17 ). One ability, arguably underused, inherent to this class of models is that they can be used in an “unsupervised” manner to learn a dissimilarity function ( 15 , 18 , 19 ).…”

Section: Introductionmentioning

confidence: 99%

“…One ability, arguably underused, inherent to this class of models is that they can be used in an “unsupervised” manner to learn a dissimilarity function ( 15 , 18 , 19 ). This is known as metric learning, and the learned dissimilarity function can be used to replace a more traditional method (such as the Jaccard distance or Bray-Curtis dissimilarity) when investigating beta diversity ( 17 , 20 ). This approach is also advantageous since it learns to remove the influence of uninformative features ( 21 ).…”

Section: Introductionmentioning

confidence: 99%

“…Unsupervised random forests have previously been used to discover similar cell populations in single-cell transcriptome sequencing (RNA-seq) data, identify different classes of renal cell carcinoma tumors, and study the underlying structure of a population using shared genetic variations ( 20 , 22 , 23 ). If this approach is applied to amplicon sequencing data, it may be possible to simultaneously visualize the differences between communities while also identifying which features contribute most to the placement of each community within the projected space ( 17 , 20 ). However, random forests do suffer an important limitation: the decision trees used to construct the forest make axis-orthogonal cuts.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations