Abstract:The method for symbolic sequence decomposition into a set of consecutive, distinct, non-overlapping strings of various lengths is proposed. Representation of the sequence as a set of words allows one to use set theory notions. The main result is a quite new definition of the similarity between any two sequences over a given alphabet. No prior sequence alignment is necessary. In the present paper two applications of a set of words are described. In the first a similarity measure is applied to prepare centroids for K-means algorithm. It results in a high performance grouping method for long DNA sequences. The other application concerns the statistical analysis of word attributes. It is shown that similarity, complexity and correlation function of word attributes across sequences of digits of fractional parts of some irrational numbers support the suggestion that the sequences are instances of a random sequence of decimal digits.