2023
DOI: 10.3390/pr11123325
|View full text |Cite
|
Sign up to set email alerts
|

On the Development of Descriptor-Based Machine Learning Models for Thermodynamic Properties: Part 1—From Data Collection to Model Construction: Understanding of the Methods and Their Effects

Cindy Trinh,
Youssef Tbatou,
Silvia Lasala
et al.

Abstract: In the present work, a multi-angle approach is adopted to develop two ML-QSPR models for the prediction of the enthalpy of formation and the entropy of molecules, in their ideal gas state. The molecules were represented by high-dimensional vectors of structural and physico-chemical characteristics (i.e., descriptors). In this sense, an overview is provided of the possible methods that can be employed at each step of the ML-QSPR procedure (i.e., data preprocessing, dimensionality reduction and model constructio… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
7
0

Year Published

2023
2023
2025
2025

Publication Types

Select...
4

Relationship

1
3

Authors

Journals

citations
Cited by 4 publications
(8 citation statements)
references
References 101 publications
1
7
0
Order By: Relevance
“…This observation, in conjunction with the distribution of enthalpy values in the dataset (cf. the first article of the series [15]), corroborates the hypothesis of poor representation of these molecules in the dataset (e.g., very large molecules). In this case, the elimination of these points is questionable.…”
Section: Outlier Detectionsupporting
confidence: 79%
See 4 more Smart Citations
“…This observation, in conjunction with the distribution of enthalpy values in the dataset (cf. the first article of the series [15]), corroborates the hypothesis of poor representation of these molecules in the dataset (e.g., very large molecules). In this case, the elimination of these points is questionable.…”
Section: Outlier Detectionsupporting
confidence: 79%
“…Then, the resulting preprocessed data without outliers were split into training and test sets, the ratio between them being fixed at 80:20, and scaled via a standard scaling method. This ratio for data splitting and this scaling method were indeed identified as well performing in the first article of the series [15]. To better integrate the effect of data splitting on the performance of the models, five different training/test splits were considered.…”
Section: Ad Definition As a Data Preprocessing Methods (Substudy 1)mentioning
confidence: 90%
See 3 more Smart Citations