2021
DOI: 10.1021/acs.iecr.1c02142
|View full text |Cite
|
Sign up to set email alerts
|

Development of Solubility Prediction Models with Ensemble Learning

Abstract: The solubility parameter is widely used to select suitable solvents for polymers in the polymer-processing industry. In this study, we established a Hildebrand solubility parameter prediction model using ensemble-learning methods. The database used in the study is from the 2019 edition of the DIPPR 801 database, which includes solubility parameters for 1889 chemicals after removing invalid entries and outliers. Three machine-learning techniques including random forest, gradient boosting, and extreme gradient (… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
9
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
8
1

Relationship

2
7

Authors

Journals

citations
Cited by 19 publications
(9 citation statements)
references
References 27 publications
0
9
0
Order By: Relevance
“…The number of MOF descriptors in the data set was large compared to the data set size. Therefore, a reduction in the number of descriptors is needed, since the accuracy of the model in machine learning depends on the quality of the domain of descriptors data. , As depicted in Figure , descriptor screening and statistical significance determination are crucial to ensure that the model includes only descriptors highly correlated to the target property. It is standard to have 10–15% ratio of descriptors per total number of molecules in the data set.…”
Section: Methodsmentioning
confidence: 99%
“…The number of MOF descriptors in the data set was large compared to the data set size. Therefore, a reduction in the number of descriptors is needed, since the accuracy of the model in machine learning depends on the quality of the domain of descriptors data. , As depicted in Figure , descriptor screening and statistical significance determination are crucial to ensure that the model includes only descriptors highly correlated to the target property. It is standard to have 10–15% ratio of descriptors per total number of molecules in the data set.…”
Section: Methodsmentioning
confidence: 99%
“…The child decision nodes will keep repeating the splitting process until stopping criteria are met (e.g., DT exceeding a maximum depth). A single DT is highly interpretable but prone to overfitting . Therefore, ensemble tree methods like random forest (RF) and gradient boosting (GB) using a technique called “sampling with replacement” were developed to prevent overfitting.…”
Section: Methodsmentioning
confidence: 99%
“…A single DT is highly interpretable but prone to overfitting. 38 Therefore, ensemble tree methods like random forest (RF) 39 and gradient boosting (GB) 40 using a technique called "sampling with replacement" were developed to prevent overfitting. Instead of establishing a single DT, RF constructs a number of DTs (the "forest") with randomly selected features to minimize feature correlation.…”
Section: Machine Learning Algorithmsmentioning
confidence: 99%
“…This model has two components, which are the back propagation neural network cross (BPC) and support vector regression (SVR). Ensemble learning 15 is to build and combine multiple learners to accomplish learning tasks, which usually achieve significantly better generalization performance than a single learner, and it can improve the prediction ability and stability of the model. The bagging method of ensemble learning is to divide data sets and combine them into multiple training sets to improve the learning effect, 16 and the BPC model is based on this idea.…”
Section: Introductionmentioning
confidence: 99%