The advances of high-throughput sequencing offer an unprecedented opportunity to study genetic variation. This is challenged by the difficulty of resolving variant calls in repetitive DNA regions. We present a Bayesian method to estimate repeat-length variation from paired-end sequence read data. The method makes variant calls based on deviations in sequence fragment sizes, allowing the analysis of repeats at lengths of relevance to a range of phenotypes. We demonstrate the method’s ability to detect and quantify changes in repeat lengths from short read genomic sequence data across genotypes. We use the method to estimate repeat variation among 12 strains of Arabidopsis thaliana and demonstrate experimentally that our method compares favourably against existing methods. Using this method, we have identified all repeats across the genome, which are likely to be polymorphic. In addition, our predicted polymorphic repeats also included the only known repeat expansion in A. thaliana, suggesting an ability to discover potential unstable repeats.
The 3′ UTRs of eukaryotic genes participate in a variety of post-transcriptional (and some transcriptional) regulatory interactions. Some of these interactions are well characterised, but an undetermined number remain to be discovered. While some regulatory sequences in 3′ UTRs may be conserved over long evolutionary time scales, others may have only ephemeral functional significance as regulatory profiles respond to changing selective pressures. Here we propose a sensitive segmentation methodology for investigating patterns of composition and conservation in 3′ UTRs based on comparison of closely related species. We describe encodings of pairwise and three-way alignments integrating information about conservation, GC content and transition/transversion ratios and apply the method to three closely related Drosophila species: D. melanogaster, D. simulans and D. yakuba. Incorporating multiple data types greatly increased the number of segment classes identified compared to similar methods based on conservation or GC content alone. We propose that the number of segments and number of types of segment identified by the method can be used as proxies for functional complexity. Our main finding is that the number of segments and segment classes identified in 3′ UTRs is greater than in the same length of protein-coding sequence, suggesting greater functional complexity in 3′ UTRs. There is thus a need for sustained and extensive efforts by bioinformaticians to delineate functional elements in this important genomic fraction. C code, data and results are available upon request.
BackgroundComputational identification of non-coding RNAs (ncRNAs) is a challenging problem. We describe a genome-wide analysis using Bayesian segmentation to identify intronic elements highly conserved between three evolutionarily distant vertebrate species: human, mouse and zebrafish. We investigate the extent to which these elements include ncRNAs (or conserved domains of ncRNAs) and regulatory sequences.ResultsWe identified 655 deeply conserved intronic sequences in a genome-wide analysis. We also performed a pathway-focussed analysis on genes involved in muscle development, detecting 27 intronic elements, of which 22 were not detected in the genome-wide analysis. At least 87% of the genome-wide and 70% of the pathway-focussed elements have existing annotations indicative of conserved RNA secondary structure. The expression of 26 of the pathway-focused elements was examined using RT-PCR, providing confirmation that they include expressed ncRNAs. Consistent with previous studies, these elements are significantly over-represented in the introns of transcription factors.ConclusionsThis study demonstrates a novel, highly effective, Bayesian approach to identifying conserved non-coding sequences. Our results complement previous findings that these sequences are enriched in transcription factors. However, in contrast to previous studies which suggest the majority of conserved sequences are regulatory factor binding sites, the majority of conserved sequences identified using our approach contain evidence of conserved RNA secondary structures, and our laboratory results suggest most are expressed.Functional roles at DNA and RNA levels are not mutually exclusive, and many of our elements possess evidence of both. Moreover, ncRNAs play roles in transcriptional and post-transcriptional regulation, and this may contribute to the over-representation of these elements in introns of transcription factors. We attribute the higher sensitivity of the pathway-focussed analysis compared to the genome-wide analysis to improved alignment quality, suggesting that enhanced genomic alignments may reveal many more conserved intronic sequences.Electronic supplementary materialThe online version of this article (doi:10.1186/s12864-017-3645-2) contains supplementary material, which is available to authorized users.
Many biological sequences have a segmental structure that can provide valuable clues to their content, structure, and function. The program changept is a tool for investigating the segmental structure of a sequence, and can also be applied to multiple sequences in parallel to identify a common segmental structure, thus providing a method for integrating multiple data types to identify functional elements in genomes. In the previous edition of this book, a command line interface for changept is described. Here we present a graphical user interface for this package, called changeptGUI. This interface also includes tools for pre- and post-processing of data and results to facilitate investigation of the number and characteristics of segment classes.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.