Search citation statements
Paper Sections
Citation Types
Year Published
Publication Types
Relationship
Authors
Journals
Ultrafast mapping of short reads to transcriptomic and metagenomic references via lightweight mapping techniques such as pseudoalignment has demonstrated success in substantially accelerating several types of analyses without much loss in accuracy compared to alignment-based approaches. The application of pseudoalignment to large reference sequences - like the genome - is, however, not trivial, due to the large size of the references or "targets" (i.e. chromosomes) and the presence of repetitive sequences within an individual reference sequence. This can lead to multiple matching locations for a k-mer within a single reference, which in turn can lead to false positive mappings and incorrect reference assignments for a read when the colors across the k-mer matches for a read are aggregated. Even when the read is determined to map to the appropriate reference, the increased occurrence of k-mer multi-matches within a reference can prevent the determination of the correct approximate position of the read, which is often critical in applications that map short reads to the genome. We propose a new and modified pseudoalignment scheme that partitions each reference into "virtual colors". These are essentially overlapping bins of fixed maximal extent on the reference sequences that are treated as distinct "colors" from the perspective of the pseudoalignment algorithm. A mapped k-mer is assigned a virtual color id that encodes the combination of the reference and within-reference bin in which the k-mer occurs. When the k-mers across a read are aggregated, the intersection is performed on virtual colors instead of the original colors (references), to determine the compatible set of targets (bins). The virtual colors can then be mapped back to the original references to provide the final mappings. The projection of the original reference sequences into virtual color space, and the corresponding modifications to the pseudoalignment procedure, can be applied dynamically at program invocation and without any modification of the underlying index itself. This makes the setting and modification of instance-appropriate parameters efficient and straightforward and the approach widely applicable. We apply this modified pseudoalignment procedure to process and map single-cell ATAC-seq data in our new tool alevin-fry-atac. We compare alevin-fry-atac to both Chromap and Cell Ranger ATAC. Alevin-fry-atac is highly scalable and, when using 32 threads, is approximately 1.78 times faster than Chromap (the second fastest approach) while using approximately 3 times less memory and mapping slightly more reads. The resulting peaks and clusters generated from alevin-fry-atac show high concordance with those obtained from both Chromap and the Cell Ranger ATAC pipeline, demonstrating that virtual color-enhanced pseudoalignment directly to the genome provides a fast, memory-frugal, and accurate alternative to existing approaches for single-cell ATAC-seq processing. The development of alevin-fry-atac brings single-cell ATAC-seq processing into a unified ecosystem with single-cell RNA-seq processing (via alevin-fry) to work toward providing a truly open alternative to many of the varied capabilities of CellRanger. Furthermore, our modified pseudoalignment approach should be easily applicable and extendable to other genome-centric mapping-based tasks and modalities such as standard DNA-seq, DNase-seq, Chip-seq and Hi-C.
Ultrafast mapping of short reads to transcriptomic and metagenomic references via lightweight mapping techniques such as pseudoalignment has demonstrated success in substantially accelerating several types of analyses without much loss in accuracy compared to alignment-based approaches. The application of pseudoalignment to large reference sequences - like the genome - is, however, not trivial, due to the large size of the references or "targets" (i.e. chromosomes) and the presence of repetitive sequences within an individual reference sequence. This can lead to multiple matching locations for a k-mer within a single reference, which in turn can lead to false positive mappings and incorrect reference assignments for a read when the colors across the k-mer matches for a read are aggregated. Even when the read is determined to map to the appropriate reference, the increased occurrence of k-mer multi-matches within a reference can prevent the determination of the correct approximate position of the read, which is often critical in applications that map short reads to the genome. We propose a new and modified pseudoalignment scheme that partitions each reference into "virtual colors". These are essentially overlapping bins of fixed maximal extent on the reference sequences that are treated as distinct "colors" from the perspective of the pseudoalignment algorithm. A mapped k-mer is assigned a virtual color id that encodes the combination of the reference and within-reference bin in which the k-mer occurs. When the k-mers across a read are aggregated, the intersection is performed on virtual colors instead of the original colors (references), to determine the compatible set of targets (bins). The virtual colors can then be mapped back to the original references to provide the final mappings. The projection of the original reference sequences into virtual color space, and the corresponding modifications to the pseudoalignment procedure, can be applied dynamically at program invocation and without any modification of the underlying index itself. This makes the setting and modification of instance-appropriate parameters efficient and straightforward and the approach widely applicable. We apply this modified pseudoalignment procedure to process and map single-cell ATAC-seq data in our new tool alevin-fry-atac. We compare alevin-fry-atac to both Chromap and Cell Ranger ATAC. Alevin-fry-atac is highly scalable and, when using 32 threads, is approximately 1.78 times faster than Chromap (the second fastest approach) while using approximately 3 times less memory and mapping slightly more reads. The resulting peaks and clusters generated from alevin-fry-atac show high concordance with those obtained from both Chromap and the Cell Ranger ATAC pipeline, demonstrating that virtual color-enhanced pseudoalignment directly to the genome provides a fast, memory-frugal, and accurate alternative to existing approaches for single-cell ATAC-seq processing. The development of alevin-fry-atac brings single-cell ATAC-seq processing into a unified ecosystem with single-cell RNA-seq processing (via alevin-fry) to work toward providing a truly open alternative to many of the varied capabilities of CellRanger. Furthermore, our modified pseudoalignment approach should be easily applicable and extendable to other genome-centric mapping-based tasks and modalities such as standard DNA-seq, DNase-seq, Chip-seq and Hi-C.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.