Proteins are generally classified into the following 12 subcellular locations: 1) chloroplast, 2) cytoplasm, 3) cytoskeleton, 4) endoplasmic reticulum, 5) extracellular, 6) Golgi apparatus, 7) lysosome, 8) mitochondria, 9) nucleus, 10) peroxisome, 11) plasma membrane, and 12) vacuole. Because the function of a protein is closely correlated with its subcellular location, with the rapid increase in new protein sequences entering into databanks, it is vitally important for both basic research and pharmaceutical industry to establish a high throughput tool for predicting protein subcellular location. In this paper, a new concept, the so-called "functional domain composition" is introduced. Based on the novel concept, the representation for a protein can be defined as a vector in a high-dimensional space, where each of the clustered functional domains derived from the protein universe serves as a vector base. With such a novel representation for a protein, the support vector machine According to the localization or compartment in a cell, proteins are generally classified into the following 12 categories: 1) chloroplast, 2) cytoplasm, 3) cytoskeleton, 4) endoplasmic reticulum, 5) extracellular, 6) Golgi apparatus, 7) lysosome, 8) mitochondria, 9) nucleus, 10) peroxisome, 11) plasma membrane, and 12) vacuole. Given the sequence of a protein, how can we predict which category or subcellular location it belongs to? This is certainly a very important problem because the subcellular location of a protein is closely correlated with its biological function. Although the information about protein subcellular location can be determined by conducting various experiments, that is both time consuming and costly. Because of the fact that the number of sequences entering into databanks has been rapidly increasing, e.g. in 1986 the total sequence entries in SWISS-PROT (1) was only 3,939 while the number was increased to 80,000 in 1999, the problem has become an urgent challenge. Particularly, it is anticipated that many more new protein sequences will be derived soon because of the recent success of the human genome project, which has provided an enormous amount of genomic information in the form of 3 billion base pairs assembled into tens of thousands of genes. Therefore, the challenge will become even more urgent and critical. Actually, many efforts have been made trying to develop some computational methods for quickly predicting the subcellular locations of proteins (2-13). It is instructive to point out that, of these algorithms, most are based on the amino acid composition alone without including any sequence-order effects, and some (9, 12, 13) are based on the pseudo amino acid composition that incorporated partial sequence-order effects. To further improve the prediction quality, a logical and key step would be to find an effective way to incorporate the sequence-order effects. The present study was initiated in an attempt to explore a different approach to incorporate these kinds of effects. The core of the new approach is ...