Caveats to Deep Learning Approaches to RNA Secondary Structure Prediction

Flamm, Christoph; Wielach, Julia; Wolfinger, Michael T.; Badelt, Stefan; Lorenz, Ronny; Hofacker, Ivo L.

doi:10.3389/fbinf.2022.835422

Cited by 25 publications

(27 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Furthermore, such analysis is inevitably influenced by the size, distribution, and consistency of the datasets used, exemplified by the noticeable gaps from the physics-based models aforementioned. Given the scarce and unbalanced nature of existing datasets, the approach of Flamm et al [ 33 ] provides a very useful avenue to interrogate model behaviors in a comprehensive and consistent manner, in which training and test datasets of arbitrary sequences are generated randomly and their structures are predicted with a thermodynamic model, allowing precise diagnosis of model characteristics. Nonetheless, unless the training data maps all sequences into a single secondary structure, DL models are expected to learn features of both RNA sequences and structures, whose relative weights would depend on the dataset, model architecture, and training method.…”

Section: Resultsmentioning

confidence: 99%

“…A number of highly successful de novo DL models have been reported, such as 2dRNA [21], ATTfold [22], DMfold [23], E2Efold [24], MXfold2 [25], SPOT-RNA [26], and Ufold [27], among others [28][29][30][31]. These DL models markedly outperform traditional algorithms, with even close-to-perfect predictions in some cases, though questions on the training vs. test similarity have been raised [32,33] and discussed below. It is worth noting that DL models have also been developed for base-level prediction tasks such as the pairing probability of each base [34].…”

Section: Introductionmentioning

confidence: 99%

“…However, markedly worse performances were observed for test sequences in different families as the training set, and retraining several current DL models led to similar observations. In the other study [33], Flamm et al explored the learning and generalization capacities of various neural networks with synthetic RNA sequences folded into secondary structures by a consistent thermodynamic model (RNAfold [36]), as well as inversefolded sequences from true structures in the bpRNA dataset. Their approach circumvents the unbalanced distributions and potential errors of available datasets and further allows engineered biases in RNA sequence or structure.…”

Section: Introductionmentioning

confidence: 99%

“…To this end, we investigate the performance and generalizability of a series of de novo DL models of different sizes under varied sequence distributions. Compared with the two recent studies [ 32 , 33 ], we focus on the development and analysis of end-to-end DL models that can match or outperform the state-of-the-art de novo DL models when trained on the same public datasets at well-defined similarity levels, and we take advantage of RNA alignment tools to elicit model characteristics quantitatively. Specifically, we chose a minimal two-module architecture without post-processing so as to probe intrinsic model characteristics.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Sequence similarity governs generalizability of de novo deep learning models for RNA secondary structure prediction

Qiu

2023

PLoS Comput Biol

View full text Add to dashboard Cite

Making no use of physical laws or co-evolutionary information, de novo deep learning (DL) models for RNA secondary structure prediction have achieved far superior performances than traditional algorithms. However, their statistical underpinning raises the crucial question of generalizability. We present a quantitative study of the performance and generalizability of a series of de novo DL models, with a minimal two-module architecture and no post-processing, under varied similarities between seen and unseen sequences. Our models demonstrate excellent expressive capacities and outperform existing methods on common benchmark datasets. However, model generalizability, i.e., the performance gap between the seen and unseen sets, degrades rapidly as the sequence similarity decreases. The same trends are observed from several recent DL and machine learning models. And an inverse correlation between performance and generalizability is revealed collectively across all learning-based models with wide-ranging architectures and sizes. We further quantitate how generalizability depends on sequence and structure identity scores via pairwise alignment, providing unique quantitative insights into the limitations of statistical learning. Generalizability thus poses a major hurdle for deploying de novo DL models in practice and various pathways for future advances are discussed.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Sequence similarity governs generalizability of de novo deep learning models for RNA secondary structure prediction

Qiu

2023

PLoS Comput Biol

View full text Add to dashboard Cite

show abstract

“…The opposite approaches use various supervised learning methodologies such as covariance models (2) or deep neural networks (14). However, recent work from (3) revealed that neural network-based methods suffer from generalization issues, making these supervised methods less suitable for discovering new motifs or functions.…”

Section: Introductionmentioning

confidence: 99%

Investigating graph neural network for RNA structural embedding

Opuu

Bret

2022

Preprint

View full text Add to dashboard Cite

The biological function of natural non-coding RNAs (ncRNA) is tightly bound to their molecular structure. Sequence analyses such as multiple sequence alignments (MSA) are the bread and butter of bio-molecules functional analysis; however, analyzing sequence and structure simultaneously is a difficult task. In this work, we propose CARNAGE (Clustering/Alignment of RNA with Graph-network Embedding), which leverages a graph neural network encoder to imprint structural information into a sequence-like embedding; therefore, downstream sequence analyses now account implicitly for structural constraints. In contrast to the traditional "supervised" alignment approaches, we trained our network on a masking problem, independent from the alignment or clustering problem. Our method is very versatile and has shown good performances in 1) designing RNAs sequences, 2) clustering sequences, and 3) aligning multiple sequences only using the simplest Needleman and Wunsch's algorithm. Not only can this approach be readily extended to RNA tridimensional structures, but it can also be applied to proteins.

show abstract