A New Method for Symbolic Sequences Analysis. An Application to Long Sequences

Kozarzewski, B.

doi:10.12921/cmst.2014.20.03.93-100

Cited by 2 publications

(5 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…has been introduced in [7]. The value of similarity varies between 0 when the spectra are disjoint sets and 1 when sequence C 1 and C 2 are mutual copies.…”

Section: Graphical Representationsmentioning

confidence: 99%

See 1 more Smart Citation

Numerical Representation of Symbolic Data

Kozarzewski

2015

CMST

View full text Add to dashboard Cite

A method of direct numerical representation of symbolic data is proposed. The method starts with parsing a sequence into an ordered set (spectrum) of distinct, non-overlapping short strings of symbols (words). Next, the words spectrum is mapped onto a vector of binary components in a high dimensional, linear space. The numerical representation allows for some arithmetical operations on symbolic data. Among them is a meaningful average spectrum of two sequences. As a test, the new numerical representation is used to build centroid vectors for the k-means clustering algorithm. It significantly enhanced the clustering quality. The advantage over the conventional approach is a high score of correct clustering several real character sequences like novel, DNA and protein.

show abstract

“…has been introduced in [7]. The value of similarity varies between 0 when the spectra are disjoint sets and 1 when sequence C 1 and C 2 are mutual copies.…”

Section: Graphical Representationsmentioning

confidence: 99%

“…In Ref. [7] the sets of most similar pairs, fours or eights sequences as the initialising set were discussed. The average vector of each set was considered as the starting centroid location.…”

Section: Graphical Representationsmentioning

confidence: 99%

Numerical Representation of Symbolic Data

Kozarzewski

2015

CMST

View full text Add to dashboard Cite

show abstract

“…In the recent issue of CMST there appeared an interesting paper [1] by B. Kozarzewski on the new method for symbolic sequences analysis. This method was tested on several long sequences, in particular on the digits of the so called "Champernowne number", which we will denote below as C 10 .…”

mentioning

confidence: 99%

“….. E. Borel had proved in 1909 [3] that almost all real numbers are normal and the first explicite example of normal number was C 10 . On page 98 of [1], right column, it is written about C 10 "The number is assumed to be transcendental". In fact, the number is transcendental as it follows from the more general theorem proved by K. Mahler in [4].…”

mentioning

confidence: 99%