2022
DOI: 10.1101/2022.06.29.498185
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Decisive Roles of Sequence Distributions in the Generalizability ofde novoDeep Learning Models for RNA Secondary Structure Prediction

Abstract: The availability of sizeable RNA structure databases and powerful deep learning (DL) frameworks has prompted recent developments of DL models for RNA secondary structure prediction. Taking RNA sequences as the only inputs, the class of de novo DL models has demonstrated far superior performances than traditional algorithms. However, key questions remain over the statistical underpinning of such DL models which make no use of co-evolutionary information or physical laws of RNA folding. Here we present a quantit… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
5
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(5 citation statements)
references
References 66 publications
(88 reference statements)
0
5
0
Order By: Relevance
“…Similar results were observed on the TORNADO dataset (see Supplementary Table 9 for details). The underlying reasons may be the susceptibility to overfitting of deep learning models, and the low data coverage and density over diverse structures, as discussed in earlier studies 66,67 . To address this challenge, potential solutions include: 1) Actively exploring the integration of inductive bias terms, such as statistical energy terms, into the KnotFold model; 2) Considering the implementation of more ensemble strategies to introduce greater diversity to the models.…”
Section: Discussionmentioning
confidence: 99%
“…Similar results were observed on the TORNADO dataset (see Supplementary Table 9 for details). The underlying reasons may be the susceptibility to overfitting of deep learning models, and the low data coverage and density over diverse structures, as discussed in earlier studies 66,67 . To address this challenge, potential solutions include: 1) Actively exploring the integration of inductive bias terms, such as statistical energy terms, into the KnotFold model; 2) Considering the implementation of more ensemble strategies to introduce greater diversity to the models.…”
Section: Discussionmentioning
confidence: 99%
“…This raises concerns about the generalizability of deep learning models trained on such limited data. Previous studies by Szikszai et al 39 and Qiu 40 highlighted the challenges of deep-learning models when applied to unseen families not present in the training and validation sets. To evaluate the adaptability of SPOT-RNA and SPOT-RNA2 beyond their training and validation data, we conducted a test by removing all test set structures with the structural similarity score (TM-score) ≥ 0.3 compared to those in the training and validation sets.…”
Section: Discussionmentioning
confidence: 99%
“…However, this also increases the risk of overfitting and poor prediction accuracy for structurally dissimilar sequences. The problem of overfitting is a prevalent issue not only in deep learning but also in other machine learning techniques with rich parametrization; it is particularly acute in deep learning as models can easily be scaled to an enormous number of parameters [ 75 , 78 ].…”
Section: Rna Secondary Structure Predictionmentioning
confidence: 99%