OverviewThere are three factors involved in text classification: the classification model, the similarity measure, and the document representation. In this chapter, we will focus on document representation and demonstrate that the choice of document representation has a profound impact on the quality of the classification. We will also show that the text quality affects the choice of document representation. In our experiments we have used the centroid-based classification, which is a simple and robust text classification scheme. We will compare four different types of document representation: N-grams, single terms, phrases, and a logic-based document representation called RDR. The N-gram representation is a string-based representation with no linguistic processing. The single-term approach is based on words with minimum linguistic processing. The phrase approach is based on linguistically formed phrases and single words. The RDR is based on linguistic processing and representing documents as a set of logical predicates. Our experiments on many text collections yielded similar results. Here, we base our arguments on experiments conducted on Reuters-21578 and contest (ASRS) collection (see Appendix). We show that RDR, the more complex representation, produces more effective classification on Reuters-21578, followed by the phrase approach. However, on the ASRS collection, which contains many syntactic errors (noise), the 5-gram approach outperforms all other methods by 13%. That is because the 5-gram approach is a robust method in presence of noise. The more complex models produce better classification results, but since they are dependent on natural language processing (NLP) techniques, they are vulnerable to noise.
In the last few years, the World Wide Web has changed tremendously. Now accessible to millions of users from hundreds of countries, it has started to show new online behaviors. Following the new patterns we now see many multilingual activities going on in large scales. In this paper, we provide an analysis on how this emerging usage patterns can affect the Machine Translation community. We identify the main motivations behind these activity patterns. Using examples we compare the traditional approaches to resource collection to new online-based approaches. We then present our experimental results of an online community designed to collect parallel corpora.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.