Comment on “Language Trees and Zipping”

Khmelev, Dmitry V.; Teahan, William J.

doi:10.1103/physrevlett.90.089803

Cited by 20 publications

(15 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This method was strongly criticized by several researchers (Goodman, 2002;Khmelev & Teahan, 2003b) indicating many weaknesses. First, it is too slow since it has to call the compression algorithm so many times (as many as the training texts).…”

Section: Similarity-based Modelsmentioning

confidence: 99%

A survey of modern authorship attribution methods

Stamatatos

2008

J. Am. Soc. Inf. Sci.

1,080

822

View full text Add to dashboard Cite

Authorship attribution supported by statistical or computational methods has a long history starting from 19th century and marked by the seminal study of Mosteller and Wallace (1964) on the authorship of the disputed Federalist Papers. During the last decade, this scientific field has been developed substantially taking advantage of research advances in areas such as machine learning, information retrieval, and natural language processing. The plethora of available electronic texts (e.g., e-mail messages, online forum messages, blogs, source code, etc.) indicates a wide variety of applications of this technology provided it is able to handle short and noisy text from multiple candidate authors. In this paper, a survey of recent advances of the automated approaches to attributing authorship is presented examining their characteristics for both text representation and text classification. The focus of this survey is on computational requirements and settings rather than linguistic or literary issues. We also discuss evaluation methodologies and criteria for authorship attribution studies and list open questions that will attract future work in this area.

show abstract

Section: Similarity-based Modelsmentioning

confidence: 99%

A survey of modern authorship attribution methods

Stamatatos

2008

J. Am. Soc. Inf. Sci.

1,080

822

View full text Add to dashboard Cite

show abstract

“…Looking for better approximations, however, we must not lose general applicability to arbitrary domains. In [6] it is argued that nth order Markov chain models perform better than gzip for text recognition. But optimization to particular domains might endanger generality.…”

Section: A Complex Task: Image Retrievalmentioning

confidence: 99%

“…This suggests that compression algorithms might be useful to offer a principled way to measure pattern similarity. For the domain of text, Benedetto et al have shown that ordinary file compression tools based on the Lempel-Ziv algorithm (LZ77) [1] can perform language recognition and authorship attribution, although these programs were never designed for this kind of tasks [2], [3] (for comments see [4], [5], [6]). Given n versions T 1 .…”

Section: A Recognition By Compressionmentioning

confidence: 99%

Compression for visual pattern recognition

Heidemann

Ritter

2008

2008 3rd International Symposium on Communications, Control and Signal Processing

View full text Add to dashboard Cite

To date, computer science solves pattern recognition problems by highly task specific algorithms. Searching for a generic, unifying principle of pattern recognition, Benedetto et al. showed that compression is a good candidate: For the domain of text, approximating the mutual information of patterns by the achievable compression factors allows to detect similarity with surprising accuracy. Here we show that this principle is much more general than expected, since the common compressor gzip is able to solve a major computer vision problem, the classification of image categories. This result is remarkable for three reasons: Firstly, the applied compression programs were never designed for pattern recognition tasks of any kind; secondly, hardly any other method is able to deal with patterns as different as text and pixel images alike without any modification; thirdly, compression solves the entire task in a single step, without any data preprocessing, feature extraction or the need for parametrization. We will discuss the theoretical background of this finding and point out the role of compression in a yet to develop general theory of pattern recognition.

show abstract

“…Perhaps the most controversial work is that of Benedetto et al [46] who addressed the authorship authentication problem using gzip coupled with BCN (discussed in Section III-B1). To get more information about what happened, the interested reader is referred to following sources [47], [48], [49], [50], [51].…”

Section: B Previous Work On Ctcmentioning

confidence: 99%

Compression-based arabic text classification

Taamneh

Keshek

Issa

et al. 2014

2014 IEEE/ACS 11th International Conference on Computer Systems and Applications (AICCSA)

View full text Add to dashboard Cite

Text classification (TC) is one of the fundamental problems in text mining. Plenty of works exist on TC with interesting approaches and excellent results; however, most of these works follow a word-based approach for feature extraction. In this work, we are interested in an alternative (byte-based or character-based) approach known as compression-based TC (CTC). CTC has been used for some languages such as English and Portuguese and it is shown to have certain advantages/disadvantages compared with word-based approaches. This work applies CTC on the Arabic language with the purpose of investigating whether these advantages/disadvantages exists for the Arabic language as well. The results are encouraging as they show the viability of using CTC for Arabic TC.

show abstract

Comment on “Language Trees and Zipping”

Cited by 20 publications

References 7 publications

A survey of modern authorship attribution methods

A survey of modern authorship attribution methods

Compression for visual pattern recognition

Compression-based arabic text classification

Contact Info

Product

Resources

About