Key words: Schistosoma mansoni -bioinformatics -expressed sequences tag -clustering analysis -metabolismSchistosoma mansoni is a dioiceous trematode and one of the etiologic agents of schistosomiasis, the second more significant tropical disease concerning public health. Despite recent efforts undertaken to contain its progress, the disease is still endemic in several countries, with around 200 million people infected by the parasite (http://www.who.int/ctd/schisto/epidemio.htm). The study of S. mansoni is, therefore, very important in human parasitology. Gaining knowledge on the genome of this parasite is essential for a better understanding of its metabolism and biology and will help to elucidate important aspects of the mechanisms of drug resistance and antigenic variation that allow it to escape from the host immune system (Franco et al. 2000).The size of S. mansoni genome is estimated in 270Mb with the number of expressed genes ranging from 15000 to 20000 (Simpson et al. 1982, Franco & Simpson 2001. Although some genomic sequences of S. mansoni have been produced, the Schistosoma Genome Network (SGN) has chosen as first priority the sequencing of cDNA using the expressed sequence tags (ESTs) strategy, from which is possible to obtain fast and relevant information Although resulting in fast and very important information, ESTs available from public databases, such as dbEST, show some degree of redundancy and present a great number of errors, because they are single pass sequences (Miller et al. 1999). To overcome these problems and to increase the length of the sequences, facilitating identification by homology searches, clustering procedures are performed (Oliveira & Johnston 2001). In this kind of procedure, sequences that have some region of similarity are joined into a cluster. Therefore, sequences possessing overlapping regions and representing a single gene are joined into the same cluster, decreasing redundancy. Sequences of each cluster are then aligned to generate a consensus sequence. In this approach, the base (and, if available, the quality value designated by the base caller program) present in each sequence position is considered in the construction of a high quality consensus (Huang & Madan 1999). The clustering procedure can, therefore, have two outcomes: consensus are generated by the alignment of the sequences of a cluster and singlets result from sequences that have not been grouped to any others. Theoretically, each sequence (either a consensus or a singlet) should represent an individual gene, and so, these sequences are called uniques. As is expected that each sequence represents a single gene, the comparison of the number of uniques with the total number of predicted genes make it possible to know, approximately, how many genes have not been discovered yet.