The widespread use of high throughput genome sequencing technologies has resulted in a significant increase in the number of available sequences, creating new challenges for genome annotation and prediction of protein-coding genes, in terms of error detection and quality control. Multiple Sequence Alignments (MSA) of the predicted protein sequences provide important contextual information that can be used to distinguish errors (caused by artifacts in the raw genome data, badly predicted gene sequences, or the alignment methods themselves) from true biological events, either by human expertise or statistical analysis of the sequence data. Here, we propose a new approach that consists in using visual representations of MSAs from an in-house dataset, in which errors are carefully identified, as inputs of Convolutional Neural Networks (CNN) classifying MSAs into erroneous and non-erroneous categories. Our model, called De-MISTED (Deep learning for MultIple Sequence alignmenTs Error Detection) shows a high accuracy (87%) and sensitivity (92%) in identifying MSAs containing erroneous sequences. Visual explanation techniques show that our model correctly identifies the correct position of multiple errors of different types (insertions, deletions and mismatches). Close examination of the data showed that our model can also correctly identify errors that were not annotated in the data. The De-MISTED method thus contributes to a more robust exploitation of the genome data.
Multiple Sequence Alignments set the basis for many biological sequence analysis methods. However, they are susceptible to irregularities that result either from the predicted sequences or from natural biological events. In this paper, we propose MERLIN (Msa ERror Localization and IdentificatioN), an object detector that consists in identifying such irregularities using visual representations of MSAs. Our model is developed using a state-of-the-art deep learning object detector, YOLOv4, and trained on a set of MSA images from an in-house built dataset with automatically annotated errors. Our object detector exhibits a mean Average Precision of 71.18% in predicting different types of errors within MSAs. We conducted a thorough examination of the obtained results which showed that our method correctly identifies certain inconsistencies that were missed by the automatic annotation algorithm.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.