2005
DOI: 10.1088/1742-5468/2005/04/p04002
|View full text |Cite
|
Sign up to set email alerts
|

Artificial sequences and complexity measures

Abstract: In this paper we exploit concepts of information theory to address the fundamental problem of identifying and defining the most suitable tools to extract, in a automatic and agnostic way, information from a generic string of characters. We introduce in particular a class of methods which use in a crucial way data compression techniques in order to define a measure of remoteness and distance between pairs of sequences of characters (e.g. texts) based on their relative information content. We also discuss in det… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
11
0

Year Published

2005
2005
2017
2017

Publication Types

Select...
5
1
1

Relationship

2
5

Authors

Journals

citations
Cited by 11 publications
(11 citation statements)
references
References 70 publications
0
11
0
Order By: Relevance
“…Compression algorithms are especially efficient when examining natural language because they contain so many redundancies (Brillouin, 2004). Scholars have demonstrated how zipping can be used in measuring language similarity between two or more texts (Baronchelli, Caglioti, & Loreto, 2005) and how compression algorithms and entropybased approaches are useful to measure online texts (Gordon, Cao, & Swanson, 2007;Huffaker, Jorgensen, Iacobelli, Tepper, & Cassell, 2006;Nigam, Lafferty, & McCallum, 1999;Schneider, 1996). In our study, entropy represents the number of similar linguistic choices at both word and phrase levels.…”
Section: Independent Measuresmentioning
confidence: 99%
“…Compression algorithms are especially efficient when examining natural language because they contain so many redundancies (Brillouin, 2004). Scholars have demonstrated how zipping can be used in measuring language similarity between two or more texts (Baronchelli, Caglioti, & Loreto, 2005) and how compression algorithms and entropybased approaches are useful to measure online texts (Gordon, Cao, & Swanson, 2007;Huffaker, Jorgensen, Iacobelli, Tepper, & Cassell, 2006;Nigam, Lafferty, & McCallum, 1999;Schneider, 1996). In our study, entropy represents the number of similar linguistic choices at both word and phrase levels.…”
Section: Independent Measuresmentioning
confidence: 99%
“…More recently, researchers have experimented with data compression algorithms as a measure of document complexity and similarity. This technique uses compression ratios as an approximation of a document's information entropy (Baronchelli, Caglioti, & Loreto, 2005;Benedetto, Caglioti, & Loreto, 2002). Standard Zipping algorithms have demonstrated effectiveness in a variety of document comparison and classification tasks.…”
Section: Computational Measures Of Language Similaritymentioning
confidence: 99%
“…For a given compressor, and two objects (e.g. bit-strings) of equal length, the object with the shortest compressed representation can be considered less complex as it contains more identifiable regularity [1].…”
Section: Measuring Complexitymentioning
confidence: 99%