Background
With the development of sequencing technology, the reads produced by third-generation sequencing technologies have ultra-long lengths and cost-effectiveness and have drawn the favor of many application scenes. However, compared with the shorter reads obtained by next-generation sequencing, the high error rates of these long reads affect the accuracy in downstream analysis.
Results
In this work, a novel hybrid error correction algorithm (NMTHC) based on the sequence-to-sequence framework is proposed, which converts the long-read error correction problem into a machine translation problem in natural language processing. First, the high-precision short reads are aligned to the long reads to generate alignment information, and then the long reads and alignment information are tokenized to generate “long sentences” and “target sentences”. The generated sequences are padded to the same length and then divided into batches for training. Both the encoding and the decoding layer of the model use the bidirectional Long Short-term Memory (Bi-LSTM) network to encode and decode hidden state information, and the generated token and the updated state at the current time step are used to predict the token of the next time step.
Conclusions
We have tested our algorithm on subsets of long reads, and the results show that NMTHC is not only excellent in the number of aligned bases with the reference genome but also improves the alignment identity without any loss of the read length. In addition, it has been proved that NMTHC can be distributed in multiple GPUs to improve the correction speed. In summary, NMTHC can correct long reads more accurately while keeping read length and continuity. It can be applied to sequencing data from both mainstream platforms and provides a new perspective for hybrid error correction algorithms based on deep learning.