Background
Only 1.5% of the human genome encodes proteins, while most of the remaining encodes noncoding RNAs (ncRNA). Many ncRNAs form structures and perform many important functions. Accurately identifying structured ncRNAs in the human genome and discovering their biological functions remain a major challenge.
Results
Here, we have established a pipeline (CM-line) with the following features for analyzing the large genomes of humans and other animals. First, we selected species with larger genetic distances to facilitate the discovery of covariations and compatible mutations. Second, we used CMfinder, which can generate useful alignments even with low sequence conservation. Third, we removed repetitive sequences and known structured ncRNAs to reduce the workload of CMfinder. Fourth, we used Infernal to find more representatives and refine the structure. We reported 11 classes of structured ncRNA candidates with significant covariations in humans. Functional analysis showed that these ncRNAs have variable functions. Some may regulate circadian clock genes through poly (A) signals (PAS); some may regulate the elongation factor (EEF1A) and the T-cell receptor signaling pathway by cooperating with RNA binding proteins.
Conclusions
By searching for important features of RNA structure from large genomes, the CM-line has revealed the existence of a variety of novel structured ncRNAs. Functional analysis provides evidence for the potential biological functions of some newly found ncRNA motifs. The pipeline we have established for the discovery of structured ncRNAs and the identification of their functions can also be applied to analyze other large genomes.