OverviewThere are three factors involved in text classification: the classification model, the similarity measure, and the document representation. In this chapter, we will focus on document representation and demonstrate that the choice of document representation has a profound impact on the quality of the classification. We will also show that the text quality affects the choice of document representation. In our experiments we have used the centroid-based classification, which is a simple and robust text classification scheme. We will compare four different types of document representation: N-grams, single terms, phrases, and a logic-based document representation called RDR. The N-gram representation is a string-based representation with no linguistic processing. The single-term approach is based on words with minimum linguistic processing. The phrase approach is based on linguistically formed phrases and single words. The RDR is based on linguistic processing and representing documents as a set of logical predicates. Our experiments on many text collections yielded similar results. Here, we base our arguments on experiments conducted on Reuters-21578 and contest (ASRS) collection (see Appendix). We show that RDR, the more complex representation, produces more effective classification on Reuters-21578, followed by the phrase approach. However, on the ASRS collection, which contains many syntactic errors (noise), the 5-gram approach outperforms all other methods by 13%. That is because the 5-gram approach is a robust method in presence of noise. The more complex models produce better classification results, but since they are dependent on natural language processing (NLP) techniques, they are vulnerable to noise.