<span style="font-size: 10pt; font-family: "Times New Roman"; mso-fareast-font-family: 宋体; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA; mso-bidi-font-size: 9.0pt;" lang="EN-US">We proposed a novel index-based online text classification method, investigated two index models, and compared the performances of various index granularities for English and Chinese SMS message. Based on the proposed method, six individual classifiers were implemented according to various text features of Chinese message, which were further combined to form an ensemble classifier. The experimental results from English corpus show that the relevant feature among words can increase the classification confidence and the trigram co-occurrence feature of words is an appropriate relevant feature. The experimental results from real Chinese corpus show that the performance of classifier applying word-level index model is better than the one applying document-level index model. The trigram segment outperforms the exact segment in indexing, so it is not necessary to segment Chinese text exactly when indexing by our proposed method. Applying parallel multi-thread ensemble learning, our proposed method has constant time complexity, which is critical to large scale data and online filtering.</span>
Through the investigation of email document structure, this paper proposes a multi-field learning (MFL) framework, which breaks the multi-field document Text Classification (TC) problem into several sub-document TC problems, and makes the final category prediction by weighted linear combination of several subdocument TC results. Many previous statistical TC algorithms can be easily rebuilt within the MFL framework via turning binary result to spamminess score, which is a real number and reflects the likelihood that the classified email is spam. The experimental results in the TREC spam track show that the performances of many TC algorithms can be improved within the MFL framework.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.