a r t i c l e i n f o a b s t r a c tThe paper presents the history and present state of the GUHA method, its theoretical foundations and its relation and meaning for data mining.
A survey of development of the GUHA method"GUHA" is the acronym for General Unary Hypotheses Automaton. The idea of the method is: given data, let the computer generate all (or as much as possible) interesting hypotheses of a given logical form that are supported by the data. This idea was elaborated by M. Chytil and P. Hájek in mid-sixties of the last century, the first paper in English being [16]. The approach was as follows: Data to be processed form a rectangular matrix of zeros and ones, rows corresponding to objects and columns to attributes (properties). Let P 1 , . . . , P n be names of the attributes. For each attribute P i , ¬P i is the name of its negation. An elementary conjunction of length k (1 k n) is a conjunction of k literals in which each predicate occurs at most once, e.g. ¬P 3 , P 1 & ¬P 3 & P 7 ; similarly an elementary disjunction (e.g. P 1 ∨ ¬P 3 ∨ P 7 ). An object satisfies an elementary conjunction if it satisfies all its members; it satisfies an elementary disjunction if it satisfies at least one of its members.Let 0 p 1. A formula A ⇒ p S where A is an elementary conjunction (antecedent) and S is an elementary disjunction (succedent) is true in the data if at least 100p percent of objects satisfying A satisfies S, i.e. a/r p where r is the number of objects satisfying A and a is the number of objects satisfying both A and S. The antecedent A is t-good (where t is a natural number) if at least t objects satisfy it. The version of GUHA described in [16] systematically generates "strongest" true formulas A ⇒ p S with a t-good antecedent, notation: A ⇒ p,t S. (Details omitted; "strongest" refers to a notion of a logical rule of immediate consequence among formulas of our form.) See also [2].The reader easily recognizes similarity with the notion of an "associational rule with support and confidence" introduced by Agrawal [1] about 25 years later: his A and S are elementary conjunctions containing no negation, p is the confidence and support is t/m, where m is the number of all objects in the data. 35 The formulas found by GUHA (i.e. by a computer program implementing it) have the form "almost all objects satisfying the antecedent satisfy the succedent (and the number of objects satisfying the antecedent is not too small)." It is stressed that the found results are formulas true in the data and they are hypotheses from the point of view of a universe from which the data are a sample. The slogan has been "GUHA offers everything interesting" (all hypotheses of the given form true in the data). The first implementation (by I. Havel) worked on a computer MINSK22.In 1968 Hájek (in a paper in Czech) suggested a different version based on the statistical Fisher test. Given A and S (now two elementary conjunctions with no predicates in common), let a, b, c, d be the numbers of objects satisfying A & B,A & ¬B, ¬A & B and ¬...