A neural network classification method is developed as an alternative approach to the large database search/ organization problem. The system, termed Protein Classification Artificial Neural System (ProCANS), has been implemented on a Cray supercomputer for rapid superfamily classification of unknown proteins based on the information content of the neural interconnections. The system employs an n-gram hashing function that is similar to the k-tuple method for sequence encoding. A collection of modular back-propagation networks is used to store the large amount of sequence patterns. The system has been trained and tested with the first 2,148 of the 8,309 entries of the annotated Protein Identification Resource protein sequence database (release 29). The entries included the electron transfer proteins and the six enzyme groups (oxidoreductases, transferases, hydrolases, lyases, isomerases, and ligases), with a total of 620 superfamilies. After a total training time of seven Cray central processing unit (CPU) hours, the system has reached a predictive accuracy of 90%. The classification is fast (i.e., 0.1 Cray CPU second per sequence), as it only involves a forward-feeding through the networks. The classification time on a full-scale system embedded with all known superfamilies is estimated to be within 1 CPU second. Although the training time will grow linearly with the number of entries, the classification time is expected to remain low even if there is a 10-100-fold increase of sequence entries. The neural database, which consists of a set of weight matrices of the networks, together with the ProCANS software, can be ported to other computers and made available to the genome community. The rapid and accurate superfamily classification would be valuable to the organization of protein sequence databases and to the gene recognition in large sequencing projects.Keywords: database search; neural networks; protein classification; sequence analysis; superfamilyThe continuing rapid growth of the molecular sequencing data has generated a pressing need for advanced computational tools to analyze and manage the data. An ideal computer tool should allow the interpretation of genomic information from the sequences and permit easy organization of the information into a database to facilitate information retrieval. Currently, a database search for sequence similarities represents the most direct computational approach to the analysis of genomic information (Doolittle, Quicksearch method (Devereux, 1988) provides an even faster but less sensitive search against the database that is represented with a sparse hash table. A BLAST approach (Altschul et al., 1990), which directly approximates alignments that optimize a measure of local similarity, also permits fast sequence comparisons. In contrast to the above methods that are designed for pairwise comparisons, a profile analysis method (Gribskov et al., 1987) provides search against information from protein families instead of individual proteins using dynamic programming ...