Benedetto, Caglioti, and Loreto Reply:

Benedetto, Dario; Caglioti, Emanuele; Loreto, Vittorio

doi:10.1103/physrevlett.90.089804

Cited by 24 publications

(44 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…if most of the matches occur in the English part, the expression (1) will give a measure of the relative entropy. We have checked this method on sequences for which the relative entropy is known, obtaining an excellent agreement between the theoretical value of the relative entropy and the computed value [15]. The results of our experiments on linguistic corpora turned out to be very robust with respect to large variations on the size of the file b (typically 1 − 15 Kilobytes (Kb) for a typical size of file A of the order of 32 − 64 Kb).…”

mentioning

confidence: 91%

See 1 more Smart Citation

Language Trees and Zipping

2002

Self Cite

View full text Add to dashboard Cite

In this letter we present a very general method to extract information from a generic string of characters, e.g. a text, a DNA sequence or a time series. Based on data-compression techniques, its key point is the computation of a suitable measure of the remoteness of two bodies of knowledge. We present the implementation of the method to linguistic motivated problems, featuring highly accurate results for language recognition, authorship attribution and language classification. (PACS: 89.70.+c,05.) Many systems and phenomena in nature are often represented in terms of sequences or strings of characters. In experimental investigations of physical processes, for instance, one typically has access to the system only through a measuring device which produces a time record of a certain observable, i.e. a sequence of data. On the other hand other systems are intrinsically described by string of characters, e.g. DNA and protein sequences, language.When analyzing a string of characters the main question is to extract the information it brings. For a DNA sequence this would correspond to the identification of the sub-sequences codifying the genes and their specific functions. On the other hand for a written text one is interested in understanding it, i.e. recognize the language in which the text is written, its author, the subject treated and eventually the historical background.The problem cast in such a way, one would be tempted to approach it from a very interesting point of view: that of information theory [1,2]. In this context the word information acquires a very precise meaning, namely that of the entropy of the string, a measure of the surprise the source emitting the sequences can reserve to us.As it is evident the word information is used with different meanings in different contexts. Suppose now for a while to be able to measure the entropy of a given sequence (e.g. a text). Is it possible to obtain from this measure the information (in the semantic sense) we were trying to extract from the sequence? This is the question we address in this paper.In particular we define in a very general way a concept of remoteness (or similarity) between pairs of sequences based on their relative informatic content. We devise, without loss of generality with respect to the nature of the strings of characters, a method to measure this distance based on data-compression techniques. The specific question we address is whether this informatic distance between pairs of sequences is representative of the real semantic difference between the sequences. It turns out that the answer is yes, at least in the framework of the examples on which we have implemented the method.

show abstract

mentioning

confidence: 91%

“…it tends progressively to find most of the matches in the Italian part with respect to the English one, and changes its rules. Therefore if the length of the Italian file is "small enough" [15], i.e. if most of the matches occur in the English part, the expression (1) will give a measure of the relative entropy.…”

mentioning

confidence: 99%

Language Trees and Zipping

2002

Self Cite

View full text Add to dashboard Cite

show abstract

“…It is important to remark that the relative and cross entropies are not distances (metric) in the mathematical sense, since they are not symmetric and do not satisfy in general the triangular inequality. Defining a true distance between strings is an important issue both for theoretical and practical reasons (see for some recent approaches [11,12,13] and for a short review [21]). …”

Section: Entropy and Complexitymentioning

confidence: 99%

“…Recently, a method has been proposed for the estimate of the cross entropy between two strings based on LZ77 [11]. Recalling that the cross entropy C(A|B) between two strings A and B, is given by the entropy per character of B in the optimal coding for A, the idea is that of appending the two sequences and zipping the resulting file A + B.…”

Section: Zippersmentioning

confidence: 99%

Measuring complexity with zippers

Baronchelli¹,

Caglioti²,

Loreto³

2005

Eur. J. Phys.

Self Cite

View full text Add to dashboard Cite

Abstract. Physics concepts have often been borrowed and independently developed by other fields of science. In this perspective a significant example is that of entropy in Information Theory. The aim of this paper is to provide a short and pedagogical introduction to the use of data compression techniques for the estimate of entropy and other relevant quantities in Information Theory and Algorithmic Information Theory. We consider in particular the LZ77 algorithm as case study and discuss how a zipper can be used for information extraction.

show abstract

“…This statement is technically only true asymptotically but in practice exceptions grow exponentially unlikely for mixing sources. This property concerning the relative entropy was recently used to distinguish and categorize natural languages from only representative samples of their texts [15], although there the slightly different algorithm was used and adaptation to the second sequence continued during its parsing, lowering the discrimination power somewhat.…”

Section: Adaptive Dictionary-based Time-symmetry Testingmentioning

confidence: 99%

Testing time symmetry in time series using data compression dictionaries

Kennel

2004

Phys. Rev. E

View full text Add to dashboard Cite

Time symmetry, often called statistical time reversibility, in a dynamical process means that any segment of time-series output has the same probability of occurrence in the process as its time reversal. A technique, based on symbolic dynamics, is proposed to distinguish such symmetrical processes from asymmetrical ones, given a time-series observation of the otherwise unknown process. Because linear stochastic Gaussian processes, and static nonlinear transformations of them, are statistically reversible, but nonlinear dynamics such as dissipative chaos are usually statistically irreversible, a test will separate large classes of hypotheses for the data. A general-purpose and robust statistical test procedure requires adapting to arbitrary dynamics which may have significant time correlation of undetermined form. Given a symbolization of the observed time series, the technology behind adaptive dictionary data compression algorithms offers a suitable estimate of reversibility, as well as a statistical likelihood test. The data compression methods create approximately independent segments permitting a simple and direct null test without resampling or surrogate data. We demonstrate the results on various time-series-reversible and irreversible systems.

show abstract

Benedetto, Caglioti, and Loreto Reply:

Cited by 24 publications

References 4 publications

Language Trees and Zipping

Language Trees and Zipping

Measuring complexity with zippers

Testing time symmetry in time series using data compression dictionaries

Contact Info

Product

Resources

About