1Ortholog inference is a key step in understanding the evolution and function of a gene or 2 other genomic feature. Yet often no similar sequence can be identified, or the true ortholog 3 is hidden among false positives. A solution is to consider the sequence's genomic context. 4 We present the generic program, synder, for tracing features of interest between genomes 5 based on a synteny map. This approach narrows genomic search-space independently of 6 the sequence of the feature of interest. We illustrate the utility of synder by finding 7 orthologs for the Arabidopsis thaliana 13-member gene family of Nuclear Factor YC 8 transcription factor across the Brassicaceae clade. 9 1 Introduction 10 A powerful first step in understanding the evolution and function of a genomic feature is 11 resolving its genomic context, that is, comparing the feature to orthologous features in 12 other species. Comparing multiple orthologous features across species allows evolutionary 13 patterns to be uncovered. These patterns may include evidence of purifying selection, which 14 implies the feature is important to the survival of the species; positive selection, implying 15 the feature is rapidly evolving along one lineage; and functional dependencies between sites 16 (for example, amino acids in an enzyme reaction site) [1]. These evolutionary trends have 17 direct application in fields such as rational protein design [2]. Distinguishing between 18 orthologs (homologous features arising through speciation) and paralogs (homologous 19 features arising through gene duplication) is foundational to understanding the history of a 20 feature. Genomic context is also critical for discerning the origins of the often large 21 numbers of species-specific "orphan" genes that are found in most genome projects [3][4][5][6].
22Identifying orthologs is not easy. A simple sequence similarity search of a query feature 23 (e.g., a gene, transposon, miRNA, or any sequence interval) against a genome or proteome 24 of a target species may obtain thousands of hits in a swooping continuum; these could 25 include: the true ortholog, related family members (paralogs), and non-specific hits.
26Therefore, methods for winnowing the search results have been developed to identify the 27 true orthologs. A straightforward approach to identify orthologs of protein-coding genes is 28 reciprocal best hits [7]. In this technique, a protein encoded by a gene from the focal species 29 is searched (e.g. with BLAST) against the target proteome. The highest scoring gene is 30 then searched back against the proteome of the focal species. If the top scoring hit of the 31 second search is the original query gene, then the two genes are accepted as orthologs.
32There are also methods that build on reciprocal best hits, such as the reciprocal smallest 33 distance method that considers evolutionary distance in addition to similarity score [8].
34Little or no significant sequence similarity is expected across species for some classes of 35 2 features. A lack of significan...