The mapping of many short sequences of DNA, called reads, to a long reference genome is an common task in molecular biology. The task amounts to a simple string search, allowing for a few mismatches due to mutations and inexact read quality. While existing solutions attempt to align a high percentage of the reads using small memory footprints, Shepard is concerned with only exact matches and speed. Using the human genome, Shepard is on the order of hundreds of thousands of times faster than current software implementations such as SOAP2 or Bowtie, and about 60 times faster than GPU implementations such as SOAP3.Shepard contains two components: a software program to preprocess a reference genome into a hash table, and a hardware pipeline for performing fast lookups. The hash table has one entry for each unique 100 base pair sequence that occurs in the reference genome, and contains the index of last occurrence and the number of occurrences. To reduce the hash table size, a minimal perfect hash table is used. The hardware pipeline was designed to perform hash table lookups very quickly, on the order of 600 million lookups per second, and was implemented on a Convey HC-1 high performance reconfigurable computing system. Shepard streams all of the short reads through a custom hardware pipeline and writes the alignment data (index of last occurrence and number of occurrences) to a binary results array.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.