2022
DOI: 10.3389/fbinf.2022.835422
|View full text |Cite
|
Sign up to set email alerts
|

Caveats to Deep Learning Approaches to RNA Secondary Structure Prediction

Abstract: Machine learning (ML) and in particular deep learning techniques have gained popularity for predicting structures from biopolymer sequences. An interesting case is the prediction of RNA secondary structures, where well established biophysics based methods exist. The accuracy of these classical methods is limited due to lack of experimental parameters and certain simplifying assumptions and has seen little improvement over the last decade. This makes RNA folding an attractive target for machine learning and con… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

1
26
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 25 publications
(27 citation statements)
references
References 22 publications
1
26
0
Order By: Relevance
“…Furthermore, such analysis is inevitably influenced by the size, distribution, and consistency of the datasets used, exemplified by the noticeable gaps from the physics-based models aforementioned. Given the scarce and unbalanced nature of existing datasets, the approach of Flamm et al [ 33 ] provides a very useful avenue to interrogate model behaviors in a comprehensive and consistent manner, in which training and test datasets of arbitrary sequences are generated randomly and their structures are predicted with a thermodynamic model, allowing precise diagnosis of model characteristics. Nonetheless, unless the training data maps all sequences into a single secondary structure, DL models are expected to learn features of both RNA sequences and structures, whose relative weights would depend on the dataset, model architecture, and training method.…”
Section: Resultsmentioning
confidence: 99%
See 3 more Smart Citations
“…Furthermore, such analysis is inevitably influenced by the size, distribution, and consistency of the datasets used, exemplified by the noticeable gaps from the physics-based models aforementioned. Given the scarce and unbalanced nature of existing datasets, the approach of Flamm et al [ 33 ] provides a very useful avenue to interrogate model behaviors in a comprehensive and consistent manner, in which training and test datasets of arbitrary sequences are generated randomly and their structures are predicted with a thermodynamic model, allowing precise diagnosis of model characteristics. Nonetheless, unless the training data maps all sequences into a single secondary structure, DL models are expected to learn features of both RNA sequences and structures, whose relative weights would depend on the dataset, model architecture, and training method.…”
Section: Resultsmentioning
confidence: 99%
“…A number of highly successful de novo DL models have been reported, such as 2dRNA [21], ATTfold [22], DMfold [23], E2Efold [24], MXfold2 [25], SPOT-RNA [26], and Ufold [27], among others [28][29][30][31]. These DL models markedly outperform traditional algorithms, with even close-to-perfect predictions in some cases, though questions on the training vs. test similarity have been raised [32,33] and discussed below. It is worth noting that DL models have also been developed for base-level prediction tasks such as the pairing probability of each base [34].…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations
“…The opposite approaches use various supervised learning methodologies such as covariance models (2) or deep neural networks (14). However, recent work from (3) revealed that neural network-based methods suffer from generalization issues, making these supervised methods less suitable for discovering new motifs or functions.…”
Section: Introductionmentioning
confidence: 99%