Concepts and methods Proc. NaWl. Acad (i) Multiple Sequence Count Distribution of Common Word Occurrences. We define fk (vl, v2, ..., vs) = fk(v) = n, as the number of k-words that are represented across the sequences according to the vector v. We present the bivariate count distributions, along with permutation ranges, of common 8-words of the a-and f3-globin DNA sequences of human and mouse (Table 1). Similar results hold for the human and mouse P-globin and mouse a-and f3-globin comparisons (data not shown). The frequency of shared words is considerably higher than that obtained from corresponding permuted sequences. The number of words shared once between the human a-and p-globin sequences is also significantly high but closer to the permutation range. The fact that the human and mouse globin sequences decisively share k-words for all k 2 6 undoubtedly reflects the common function and evolution of these genes. The human a-and /3-globin sequences also show significant coincidence in k-word counts, but to a reduced degree, consistent with the precept of a more remote divergence time of the a-and 8-globin genes than the split times of the human and mouse lineages. The significant but reduced a-and f3-globin common word counts (k 2 6) perhaps also suggests partially differentiated structure or function between these globin genes.(ii) Multiple Sequence High Frequency (hf) Words. A criterion based on the cumulative number of occurrences across the s sequences prescribes a high frequency word of length k as a k-word satisfying 1t=1 vi 2 v*(k) or alternatively vi.I-*for i= 1, ..., s.hf shared k-words (k = 6-9) among Ig-K sequences ( Table 2). All the hf shared words of -7 base pairs (bp) relate to the "consensus" nonamer (GGTTTTT I lGT) postulated to play a control role in variable region (V)/joining region (J) rearrangement during B-cell differentiation (2-4). Among the three sequences, the mouse sequence tends to contain the most hf words of length -8 bp. Note that all hf 6-words of Table 2 can be formed by a single point mutation in a string of one-letter iterations with subsequent tandem duplications (cf. ref. 1). Interestingly, the common form of the polyadenylylation sequences (AATAAAj occurs nine times in each of the human and mouse sequences and six times in the rabbit sequence. This suggests that this oligonucleotide alone cannot be a decisive signal for transcription termination.(iii) Long Common Words. For s sequences we defined Krs as the length of the longest word occurring in at least r of the s sequences. With the aid ofTheorem I in ref. 1, the statistical significance of a long r-of-s shared word can be assessed. We present in Table 3 a number of examples of the several distinct longest r-of-s common words (block identities).(a) Virtually all the significant common oligonucleotides in the papova-, papilloma-, and the hepatitis viruses relate similar coding regions and are in alignment [i.e., spacings between successive significant block identities are about the same length for each sequence...