In the development of a large-dictionary real-time speech recognition system, an approach commonly accepted is based on a multi-stage design (ref 1). In the first stages, starting from the acoustic data produced by uttering an item (syllable, word, sentence), a fast selection of a small subset of the vocabulary is performed. In the last stage, a detailed search of the most likely item is conducted over the previously identified subset. The selection, as fast as possible, shoutd be able to include always the pronounced item; nevertheless, it must have a high resolution power, that is keep small the chosen subset.We approach the design of one of the stages by the introduction of classes of equivalence among items, sclected via the definition of an acoustical distance. Each item (a word in our case) is represented by a hidden Markov model (HMM), giving a statistical description of the relationship between words and acoustical data. We investigate two different definitions of distance between words: the first one identifies the capability of the model of a word of producing the acoustical data generated by uttering several instances of another word; the second definition is based on differences in the structure and parameters of the models of words.Starting from the obtained distance matrix, a classification method is used. It is based on a minimal spanning tree approach and allows to find the classiftcation which could keep low the number of words to be sclected for the following dctailed phase.