In this letter we present a very general method to extract information from a generic string of characters, e.g. a text, a DNA sequence or a time series. Based on data-compression techniques, its key point is the computation of a suitable measure of the remoteness of two bodies of knowledge. We present the implementation of the method to linguistic motivated problems, featuring highly accurate results for language recognition, authorship attribution and language classification. (PACS: 89.70.+c,05.) Many systems and phenomena in nature are often represented in terms of sequences or strings of characters. In experimental investigations of physical processes, for instance, one typically has access to the system only through a measuring device which produces a time record of a certain observable, i.e. a sequence of data. On the other hand other systems are intrinsically described by string of characters, e.g. DNA and protein sequences, language.When analyzing a string of characters the main question is to extract the information it brings. For a DNA sequence this would correspond to the identification of the sub-sequences codifying the genes and their specific functions. On the other hand for a written text one is interested in understanding it, i.e. recognize the language in which the text is written, its author, the subject treated and eventually the historical background.The problem cast in such a way, one would be tempted to approach it from a very interesting point of view: that of information theory [1,2]. In this context the word information acquires a very precise meaning, namely that of the entropy of the string, a measure of the surprise the source emitting the sequences can reserve to us.As it is evident the word information is used with different meanings in different contexts. Suppose now for a while to be able to measure the entropy of a given sequence (e.g. a text). Is it possible to obtain from this measure the information (in the semantic sense) we were trying to extract from the sequence? This is the question we address in this paper.In particular we define in a very general way a concept of remoteness (or similarity) between pairs of sequences based on their relative informatic content. We devise, without loss of generality with respect to the nature of the strings of characters, a method to measure this distance based on data-compression techniques. The specific question we address is whether this informatic distance between pairs of sequences is representative of the real semantic difference between the sequences. It turns out that the answer is yes, at least in the framework of the examples on which we have implemented the method.