Molecular dynamics (MD) simulations are a widely used technique in modeling complex nanoscale interactions of atoms and molecules. These simulations can provide detailed insight into how molecules behave under certain environmental conditions. This work explores a machine learning (ML) solution to predicting long-term properties of SARS-CoV-2 spike glycoproteins (S-protein) through the analysis of its nanosecond backbone RMSD (root-mean-square deviation) MD simulation data at varying temperatures. The simulation data were denoised with fast Fourier transforms. The performance of the models was measured by evaluating their mean squared error (MSE) accuracy scores in recurrent forecasts for long-term predictions. The models evaluated include k-nearest neighbors (kNN) regression models, as well as GRU (gated recurrent unit) neural networks and LSTM (long short-term memory) autoencoder models. Results demonstrated that the kNN model achieved the greatest accuracy in forecasts with MSE scores over around 0.01 nm less than those of the GRU model and the LSTM autoencoder. Furthermore, it demonstrated that the kNN model accuracy increases with data size but can still forecast relatively well when trained on small amounts of data, having achieved MSE scores of around 0.02 nm when trained on 10,000 ns of simulation data. This study provides valuable information on the feasibility of accelerating the MD simulation process through training and predicting supervised ML models, which is particularly applicable in time-sensitive studies. Graphic abstract SARS-CoV-2 spike glycoprotein molecular dynamics simulation. Extraction and denoising of backbone RMSD data. Evaluation of k-nearest neighbors regression, GRU neural network, and LSTM autoencoder models in recurrent forecasting for long-term property predictions.
Coarse-grained (CG) modeling has defined a well-established approach to accessing greater space and time scales inaccessible to the computationally expensive all-atomic (AA) molecular dynamics (MD) simulations. Popular methods of CG follow a bottom-up architecture to match properties of fine-grained or experimental data whose development is a daunting challenge for requiring the derivation of a new set of parameters in potential calculation. We proposed a novel physics-informed machine learning (PIML) framework for a CG model and applied it, as a verification, for modeling the SARS-CoV-2 spike glycoprotein. The PIML in the proposed framework employs a force-matching scheme with which we determined the force-field parameters. Our PIML framework defines its trainable parameters as the CG force-field parameters and predicts the instantaneous forces on each CG bead, learning the force field parameters to best match the predicted forces with the reference forces. Using the learned interaction parameters, CGMD validation simulations reach the microsecond time scale with stability, at a simulation speed 40,000 times faster than the conventional AAMD. Compared with the traditional iterative approach, our framework matches the AA reference structure with better accuracy. The improved efficiency enhances the timeliness of research and development in producing long-term simulations of SARS-CoV-2 and opens avenues to help illuminate protein mechanisms and predict its environmental changes.
This paper presents a physics-informed machine learning approach to the derivation of a bottom-up coarse-grained model of the SARS-CoV-2 spike glycoprotein from all-atomic molecular dynamics simulations. The machine learning procedure employs a force-matching scheme in the optimization of interaction parameters, where the force-matching scheme is combined in methodology with the initialization of the interaction parameters by the traditional iterative Boltzmann inversion method. The force-matched machine learning procedure is constructed based on two physics-informed layers: one is the Harmonic layer consisting of bond, angle, and dihedral terms as bonded potentials; the other is the Lennard-Jones layer consisting of the non-bonded Lennard-Jones potential. Coarse-grained validation simulations are performed with the learned parameters to test the derived bottom-up coarse-grained model. The simulations are able to reach the microsecond time scale with stability. The physics-informed learning approach yields simulation speeds nearly 40,000 times faster than conventional all-atomic simulations while maintaining comparable simulation accuracy. Additionally, through examination of the non-bonded Lennard-Jones parameters and the radial distribution function analysis, the learning approach matches pairwise distances of the ground-truth data with greater accuracy than the conventional iterative approach method.
Recent advancement of spectral computed tomography (SpCT) technologies by either multi-energy spectral data acquisition with energy-integration detector or single-energy spectral data acquisition with photon counting detector has enabled the reconstruction of virtual monochromatic images (VMIs) at any energy values within and outside the energy spectral ranges of current CTs’ X-ray tubes, resulting in the possibility of not only visualizing the tissue contrast variation characteristics along the X-ray energy dimension, but also quantifying the variation characteristics by machine learning (ML) for prediction of lesion malignancy or computer-aided diagnosis (CADx). This study explored the energy spectral information of SpCT, i.e., the contrast variation characteristics along the X-ray energy dimension, for ML-CADx of lesion type of colorectal polyps. Particularly, the tissue contrast variation patterns, called energy spectral features, along the Xray energy dimension in the VMIs is investigated. A figure of merit (FOM) for the task of ML-CADx is proposed, which ranks the series of VMIs along the X-ray energy dimension by inputting each VMI into a single channel deep learning (DL) pipeline and generating a corresponding a score of AUC (area under the curve of receiver operating characteristics). Then the FOM selects different numbers of the most highly ranked VMIs as the inputs to a multi-channel DL pipeline to generate the corresponding of AUC scores until all VMIs are selected. It is hypothesized that the AUC scores from the multi-channel DL pipeline will increase to reach the highest score and then drop along the ranking order, because all VMIs have the same anatomic structure and, therefore, the strong data redundancy. The FOM reaches the highest AUC score by minimizing the redundancy. We tested the hypothesis by comparing the proposed FOM-rank ML-CADx with the widely used Karhunen-Loève (KL) transform-based ranking method where the principal components are ordered automatically by the KL transform. The lesion data include the CT images of colorectal polyps and the pathological reports after they were resected. The proposed FOM-rank method outperformed the KL-based ranking method with an optimal gain of 4.7%, showing its effectiveness in prediction of lesion malignancy.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.