debar, a sequence-by-sequence denoiser for COI-5P DNA barcode data

Nugent, Cameron M.; Elliott, Tyler A.; Ratnasingham, Sujeevan; Hebert, Paul D. N.; Adamowicz, Sarah J.

doi:10.1101/2021.01.04.425285

Cited by 2 publications

(3 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our pseudogene removal approach was most effective on datasets of the full length COI barcode sequence region but is less effective for shorter sequences (~ 300 bp). Now that newer sequencing technologies such as LoopSeq, compatible with Illumina sequencing platforms but currently only available for RNA genes, or HiFi circular consensus sequencing (PacBio), it may one day be possible for COI metabarcoding to target the full length of the barcoding region to facilitate more efficient nuMT detection [39,[66][67][68]. It would also be helpful if DNA barcode studies reported and deposited full length verified pseudogenes into public databases when possible.…”

Section: Discussionmentioning

confidence: 99%

“…For example, COI marker analysis need not be limited to operational taxonomic units (OTUs), but may also include the use of exact sequence variant (ESV) analysis for improved taxonomic resolution and permit intraspecific phylogeographic analyses [34][35][36][37]. Bioinformatic tools to remove sequence artefacts and noise specifically from COI datasets have also become available [38][39][40]. COI nuMTs have been discussed in the literature largely with regards to COI barcoding efforts [18,19,41] and only recently have tools appropriate for screening nuMTs from large batches of COI sequences become available [42].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Profile hidden Markov model sequence analysis can help remove putative pseudogenes from DNA barcoding and metabarcoding datasets

Porter

Hajibabaei

2021

BMC Bioinformatics

View full text Add to dashboard Cite

Background Pseudogenes are non-functional copies of protein coding genes that typically follow a different molecular evolutionary path as compared to functional genes. The inclusion of pseudogene sequences in DNA barcoding and metabarcoding analysis can lead to misleading results. None of the most widely used bioinformatic pipelines used to process marker gene (metabarcode) high throughput sequencing data specifically accounts for the presence of pseudogenes in protein-coding marker genes. The purpose of this study is to develop a method to screen for nuclear mitochondrial DNA segments (nuMTs) in large COI datasets. We do this by: (1) describing gene and nuMT characteristics from an artificial COI barcode dataset, (2) show the impact of two different pseudogene removal methods on perturbed community datasets with simulated nuMTs, and (3) incorporate a pseudogene filtering step in a bioinformatic pipeline that can be used to process Illumina paired-end COI metabarcode sequences. Open reading frame length and sequence bit scores from hidden Markov model (HMM) profile analysis were used to detect pseudogenes. Results Our simulations showed that it was more difficult to identify nuMTs from shorter amplicon sequences such as those typically used in metabarcoding compared with full length DNA barcodes that are used in the construction of barcode libraries. It was also more difficult to identify nuMTs in datasets where there is a high percentage of nuMTs. Existing bioinformatic pipelines used to process metabarcode sequences already remove some nuMTs, especially in the rare sequence removal step, but the addition of a pseudogene filtering step can remove up to 5% of sequences even when other filtering steps are in place. Conclusions Open reading frame length filtering alone or combined with hidden Markov model profile analysis can be used to effectively screen out apparent pseudogenes from large datasets. There is more to learn from COI nuMTs such as their frequency in DNA barcoding and metabarcoding studies, their taxonomic distribution, and evolution. Thus, we encourage the submission of verified COI nuMTs to public databases to facilitate future studies.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Profile hidden Markov model sequence analysis can help remove putative pseudogenes from DNA barcoding and metabarcoding datasets

Porter

Hajibabaei

2021

BMC Bioinformatics

View full text Add to dashboard Cite

show abstract

“…Our pseudogenes removal approach was most effective on datasets of the full length COI barcode sequence region but is less effective for shorter sequences (∼ 300 bp). This is especially relevant now that newer sequencing technologies such as LoopSeq (compatible with Illumina sequencing platforms, but currently only available for RNA genes) or HiFi circular consensus sequencing (PacBio) could one day be used for COI metabarcoding targeting the full length of the barcoding region facilitating pseudogene detection [12, 77–79]. It would also be helpful if COI barcode studies reported and deposited full length verified pseudogenes into public databases when possible.…”

Section: Discussionmentioning

confidence: 99%

Profile hidden Markov model sequence analysis can help remove putative pseudogenes from DNA barcoding and metabarcoding datasets

Porter

Hajibabaei

2021

Preprint

View full text Add to dashboard Cite

BackgroundPseudogenes are non-functional copies of protein coding genes that typically follow a different molecular evolutionary path as compared to functional genes. The inclusion of pseudogene sequences in DNA barcoding and metabarcoding analysis can lead to misleading results. None of the most widely used bioinformatic pipelines used to process marker gene (metabarcode) high throughput sequencing data specifically accounts for the presence of pseudogenes in protein-coding marker genes. The purpose of this study is to develop a method to screen for obvious pseudogenes in large COI metabarcode datasets. We do this by: 1) describing gene and pseudogene characteristics from a simulated DNA barcode dataset, 2) show the impact of two different pseudogene removal methods on mock metabarcode datasets with simulated pseudogenes, and 3) incorporate a pseudogene filtering step in a bioinformatic pipeline that can be used to process Illumina paired-end COI metabarcode sequences. Open reading frame length and sequence bit scores from hidden Markov model (HMM) profile were used to detect pseudogenes.ResultsOur simulations showed that it was more difficult to identify pseudogenes from shorter amplicon sequences such as those typically used in metabarcoding (∼300 bp) compared with full length DNA barcodes that are used in construction of barcode libraries (∼ 650 bp). It was also more difficult to identify pseudogenes in datasets where there is a high percentage of pseudogene sequences. We show that existing bioinformatic pipelines used to process metabarcode sequences already remove some apparent pseudogenes, especially in the rare sequence removal step, but the addition of a pseudogene filtering step can remove more.ConclusionsThe combination of open reading frame length and hidden Markov model profile analysis can be used to effectively screen out obvious pseudogenes from large datasets. There is more to learn from COI pseudogenes such as their frequency in DNA barcode and metabarcoding studies, their taxonomic distribution, and evolution. Thus, we encourage the submission of verified COI pseudogenes to public databases to facilitate future studies.

show abstract

debar, a sequence-by-sequence denoiser for COI-5P DNA barcode data

Cited by 2 publications

References 36 publications

Profile hidden Markov model sequence analysis can help remove putative pseudogenes from DNA barcoding and metabarcoding datasets

Profile hidden Markov model sequence analysis can help remove putative pseudogenes from DNA barcoding and metabarcoding datasets

Profile hidden Markov model sequence analysis can help remove putative pseudogenes from DNA barcoding and metabarcoding datasets

Contact Info

Product

Resources

About