In contrast to the fairly reliable and complete annotation of the protein coding genes in the human genome, comparable information is lacking for non-coding RNAs. We present a comparative screen of vertebrate genomes for structural non-coding RNAs, which evaluates sequence conservation, secondary structure conservation, and thermodynamic stability of putative RNA structures. We predict more than 30 000 structured RNA elements in the human genome, almost 1000 of which are conserved across all vertebrates. Roughly a third is found in introns of known genes, a sixth are potential regulatory elements in untranslated regions, about half are located far away of any known gene. Only a small fraction of these sequences has been described previously. EST data demonstrate, however, that the majority of them is at least transcribed. The widespread conservation of secondary structure points to a large number of functional ncRNAs in the human genome, which we estimate to be comparable to the number of protein-coding genes. The recent finishing of the human genome sequence emphasizes the "need for reliable experimental and computational methods for comprehensive identification of non-coding RNAs" 1 . A variety of experimental techniques have been used to uncover the human and mouse transcriptomes, in particular tiling arrays 2-4 , cDNA sequencing 5,6 , and unbiased mapping of transcription factor binding sites 7 . All these studies agree that a substantial fraction of the genome is transcribed and that a large fraction of the transcriptome consists of non-coding RNAs. It is unclear, however, which fraction are functional non-coding RNAs (ncRNAs), and which constitutes "transcriptional noise" 8 .Genome-wide computational surveys of ncRNAs, on the other hand, have been impossible until recently, because ncRNAs do not share common signals that could be detected at the sequence level. A large class of ncRNAs, however, has characteristic structures that are functional and hence are well conserved over evolutionary timescales: most of the "classical" ncRNAs, including rRNAs, tRNAs, snRNAs, snoRNAs, as well as the RNA components of RNAse P and the signal recognition particle, are of this type. The stabilizing selection acting on the secondary structure causes characteristic substitution patterns in the underlying sequences: Consistent and compensatory mutations replace one type of base-pair by another one in the paired regions (helices) of the molecule. In addition, loop regions are more variable than helices. These patterns can be ex-1 ploited in comparative computational approaches 9-12 to discriminate functional RNAs from other types of conserved sequence. Recently, high levels of sequence conservation of non-coding DNA regions have been reported 13, 14 . Here we screen the complete collection of conserved non-coding DNA sequences from mammalian genomes and provide a first annotation of the complement of structurally conserved RNAs in the human genome.
ResultsSelection of conserved sequences and screening for structural RNAs We s...