S2Snet: deep learning for low molecular weight RNA identification with nanopore

Guan, Xiaoyu; Shao, Wei; Li, Zhongnian; Huang, Shuo; Zhang, Daoqiang

doi:10.1093/bib/bbac098

Cited by 3 publications

(14 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The input long sequence S is truncated to n sub-sequence

], and the input RNA type T is truncated to n sub-targets

]. It considers that the previous paper used the RF algorithm as the classification model to distinguish the RNA types ( Guan et al , 2022 ; Wang et al , 2021 ). Therefore, the input of RF algorithm is the feature vector ( v i ) of the sub-sequence s i extracted by feature extract methods, which contains the length, mean, standard deviation and other statistical information.…”

Section: Methodsmentioning

confidence: 99%

“…In some cases, the machine learning model C is specially set as the RF algorithm in the RNA types prediction experiment, as shown in Figure 1b . Correspondingly, we set the C as the CNN model in the ONT barcode classification experiment and the RNA type classification experiment by S2Snet ( Guan et al , 2022 ). Notably, the query function Q contains six common strategies: query-by-committee (QBC) is based on the QS ( Freund et al , 1997 ), Random is the random sampling, QUerying Informative and Representative Examples (QUIRE) is the pool-based active learning strategy ( Huang et al , 2010 ), Density is the density-based sampling AL strategy ( Nguyen and Smeulders, 2004 ), EER is Expected Error Reduction ( Roy and McCallum, 2001 ), LAL is Learning Active Learning ( Konyushkova et al , 2017 ), SPAL is Self-Paced Active Learning ( Tang and Huang, 2019 ) and UNCertainty sampling (UNC) is based on the Margin Sampling ( Lewis and Gale, 1994 ) in our experimental configuration, as shown in Figure 1c .…”

Section: Methodsmentioning

confidence: 99%

“…For RNA-CD, we use the RNA sequencing data from the previous publication ( Guan et al , 2022 ; Wang et al , 2021 ). The data type is generally the time series shown in Figure 1 .…”

Section: Methodsmentioning

confidence: 99%

“…The sequence contains not only the effective molecular sequencing signals but also the noise signals. For example, in the previous work ( Guan et al , 2022 ; Wang et al , 2021 ), the obtained sequence signal contained six RNA molecule sequencing signals and one noise signal. The experimental results show that the shapes of the sequencing signals of the three RNA molecules are similar and the noise signals have all kinds of strange shapes.…”

Section: Introductionmentioning

confidence: 99%

“…To overcome the dilemma of labeling nanopore dataset, we apply the AL-based strategy to verify their effectiveness in the nanopore field. We apply the AL-based techniques to the RNA molecule classification dataset (RNA-CD) from previous work ( Guan et al , 2022 ; Wang et al , 2021 ) and the open resource ONT barcode dataset (ONT-BD) ( Bell and Keyser, 2016 ; Misiunas et al , 2018 ). The main contributions of our work are listed below:…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Active learning for efficient analysis of high-throughput nanopore data

Guan

Zhou

et al. 2022

Bioinformatics

Self Cite

View full text Add to dashboard Cite

As the third-generation sequencing technology, nanopore sequencing has been used for high-throughput sequencing of DNA, RNA, and even proteins. Recently, many studies have begun to use machine learning technology to analyze the enormous data generated by nanopores. Unfortunately, the success of this technology is due to the extensive labeled data, which often suffer from enormous labor costs. Therefore, there is an urgent need for a novel technology that can not only rapidly analyze nanopore data with high-throughput, but also significantly reduce the cost of labeling. To achieve the above goals, we introduce active learning to alleviate the enormous labor costs by selecting the samples that need to be labeled. This work applies several advanced active learning technologies to the nanopore data, including the RNA classification dataset (RNA-CD) and the Oxford Nanopore Technologies barcode dataset (ONT-BD). Due to the complexity of the nanopore data (with noise sequence), the bias constraint is introduced to improve the sample selection strategy in active learning. The experimental results show that for the same performance metric, 50% labeling amount can achieve the best baseline performance for ONT-BD, while only 15% labeling amount can achieve the best baseline performance for RNA-CD. Crucially, the experiments show that active learning technology can assist experts in labeling samples, and significantly reduce the labeling cost. Active learning can greatly reduce the dilemma of difficult labeling of high-capacity nanopore data. We hope active learning can be applied to other problems in nanopore sequence analysis. Availability The main program is available at https://github.com/guanxiaoyu11/AL-for-nanopore. Supplementary information Supplementary data are available at Bioinformatics online.

show abstract

“…The input long sequence S is truncated to n sub-sequence

], and the input RNA type T is truncated to n sub-targets

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

“…For RNA-CD, we use the RNA sequencing data from the previous publication ( Guan et al , 2022 ; Wang et al , 2021 ). The data type is generally the time series shown in Figure 1 .…”

Section: Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Active learning for efficient analysis of high-throughput nanopore data

Guan

Zhou

et al. 2022

Bioinformatics

Self Cite

View full text Add to dashboard Cite

show abstract

T-S2Inet: Transformer-based sequence-to-image network for accurate nanopore sequence recognition

Guan,

Shao,

Zhang

2024

Bioinformatics

Self Cite

View full text Add to dashboard Cite

Motivation Nanopore sequencing is a new macromolecular recognition and perception technology that enables high-throughput sequencing of DNA, RNA, even protein molecules. The sequences generated by nanopore sequencing span a large time frame, and the labor and time costs incurred by traditional analysis methods are substantial. Recently, research on nanopore data analysis using machine learning algorithms has gained unceasing momentum, but there is often a significant gap between traditional and deep learning methods in terms of classification results. To analyze nanopore data using deep learning technologies, measures such as sequence completion and sequence transformation can be employed. However, these technologies do not preserve the local features of the sequences. To address this issue, we propose a sequence-to-image (S2I) module that transforms sequences of unequal length into images. Additionally, we propose the Transformer-based T-S2Inet model to capture the important information and improve the classification accuracy. Results Quantitative and qualitative analysis shows that the experimental results have an improvement of around 2% in accuracy compared to previous methods. The proposed method is adaptable to other nanopore platforms, such as the Oxford nanopore. It is worth noting that the proposed method not only aims to achieve the most advanced performance, but also provides a general idea for the analysis of nanopore sequences of unequal length. Availability The main program is available at https://github.com/guanxiaoyu11/S2Inet. Supplementary information Supplementary data are available at Bioinformatics online.

show abstract

Transformer Architecture and Attention Mechanisms in Genome Data Analysis: A Comprehensive Review

Choi

Lee

2023

Biology

View full text Add to dashboard Cite

The emergence and rapid development of deep learning, specifically transformer-based architectures and attention mechanisms, have had transformative implications across several domains, including bioinformatics and genome data analysis. The analogous nature of genome sequences to language texts has enabled the application of techniques that have exhibited success in fields ranging from natural language processing to genomic data. This review provides a comprehensive analysis of the most recent advancements in the application of transformer architectures and attention mechanisms to genome and transcriptome data. The focus of this review is on the critical evaluation of these techniques, discussing their advantages and limitations in the context of genome data analysis. With the swift pace of development in deep learning methodologies, it becomes vital to continually assess and reflect on the current standing and future direction of the research. Therefore, this review aims to serve as a timely resource for both seasoned researchers and newcomers, offering a panoramic view of the recent advancements and elucidating the state-of-the-art applications in the field. Furthermore, this review paper serves to highlight potential areas of future investigation by critically evaluating studies from 2019 to 2023, thereby acting as a stepping-stone for further research endeavors.

show abstract

S2Snet: deep learning for low molecular weight RNA identification with nanopore

Cited by 3 publications

References 24 publications

Active learning for efficient analysis of high-throughput nanopore data

Active learning for efficient analysis of high-throughput nanopore data

T-S2Inet: Transformer-based sequence-to-image network for accurate nanopore sequence recognition

Transformer Architecture and Attention Mechanisms in Genome Data Analysis: A Comprehensive Review

Contact Info

Product

Resources

About