2020
DOI: 10.1021/acs.jcim.9b01184
|View full text |Cite
|
Sign up to set email alerts
|

Boosting Tree-Assisted Multitask Deep Learning for Small Scientific Datasets

Abstract: Machine learning approaches have had tremendous success in various disciplines. However, such success highly depends on the size and quality of datasets. Scientific datasets are often small and difficult to collect. Currently, improving machine learning performance for small scientific

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
67
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
2
1
1

Relationship

2
7

Authors

Journals

citations
Cited by 87 publications
(67 citation statements)
references
References 73 publications
0
67
0
Order By: Relevance
“…The overfitting issue poses a challenge to traditional machine learning methods if a large number of descriptors is used. MT-DNN is a method to extract information from data sets that share certain statistical distributions, which can effectively improve the predictive ability of models on small data sets [2,13]. Based on the AGBT framework, we fuse AG-FPs and BT s -FPs, i.e., BT-FPs with a supervised fine-tuning procedure for task-specific data.…”
Section: Resultsmentioning
confidence: 99%
“…The overfitting issue poses a challenge to traditional machine learning methods if a large number of descriptors is used. MT-DNN is a method to extract information from data sets that share certain statistical distributions, which can effectively improve the predictive ability of models on small data sets [2,13]. Based on the AGBT framework, we fuse AG-FPs and BT s -FPs, i.e., BT-FPs with a supervised fine-tuning procedure for task-specific data.…”
Section: Resultsmentioning
confidence: 99%
“…This gets to a deeper challenge in machine learning, going beyond the scope of this paper --statistical power analysis (see discussion in Slater & Baker, 2018). The trend in machine learning over the last few decades has largely been to consider ever-larger data sets rather than minimum data set sizes needed (Jiang et al, 2020). While not discounting the "unreasonable effectiveness of big data" (Halevy et al, 2009), we note that it is still necessary to determine how many learners of a specific group need to be in a training set (or a separate model's training set) before the model can generally be expected to be reliable for that group.…”
Section: Summary and Discussionmentioning
confidence: 99%
“…On the other hand, well-established neural network techniques have emerged in several fields including the one for cardiovascular outcome predictions, often providing promising results, with respect to other more classical machine learning techniques, when large datasets are involved [ 8 , 9 , 10 ]. Generally, decision trees are less data demanding and GBDTs techniques are typically optimal for small datasets, whereas neural networks usually perform better on large datasets [ 11 ]. In other words, decision trees can allow the model to reach optimal convergence without requiring those large datasets which are necessary for neural networks.…”
Section: Methodsmentioning
confidence: 99%