Manuel Montes y Gómez scite author profile

Villaseñor

Errecalde

2016

Text classification is a widely studied problem, and it can be considered solved for some domains and under certain circumstances. There are scenarios, however, that have received little or no attention at all, despite its relevance and applicability. One of such scenarios is early text classification, where one needs to know the category of a document by using partial information only. A document is processed as a sequence of terms, and the goal is to devise a method that can make predictions as fast as possible. The importance of this variant of the text classification problem is evident in domains like sexual predator detection, where one wants to identify an offender as early as possible. This paper analyzes the suitability of the standard naïve Bayes classifier for approaching this problem. Specifically, we assess its performance when classifying documents after seeing an increasingly number of terms. A simple modification to the standard naïve Bayes implementation allows us to make predictions with partial information. To the best of our knowledge Naïve Bayes has not been used for this purpose before. Throughout an extensive experimental evaluation we show the effectiveness of the classifier for early text classification. What is more, we show that this simple solution is very competitive when compared with state of the art methodologies that are more elaborated. We foresee our work will pave the way for the development of more effective early text classification techniques based in the naïve Bayes formulation.

A Passage Retrieval System for Multilingual Question Answering

Soriano

Arnal

et al. 2005

In this paper we present a new method to improve the coverage of Passage Retrieval (PR) systems when these systems are employed for the Question Answering (QA) tasks. The ranking of passages obtained by the PR system is rearranged to emphasize those passages with more probability to contain the answer. The new ranking is based on finding the n-gram structures of the question that are presented in the passage, and the weight of the passages increases when they contain longer n-grams structures of the question. The results we present show that the application of this method improves notably the coverage of the classical PR system based on the Space Vectorial Model.We would like to thank CONACyT for partially supporting this work under the grant 43990A-1 as well as R2D2 CICYT (TIC2003-07158-C04-03) and ICT EU-India (ALA/95/23/2003/077-054) research projects. 1 http://clef.iei.pi.cnr.it/ V. Matoušek et al. (Eds.): TSD 2005, LNAI 3658, pp. 443-450, 2005. c Springer-Verlag Berlin Heidelberg 2005 444 José Manuel Gómez Soriano et al.

A Genetic Programming Approach for Driving Score Calculation in the Context of Intelligent Transportation Systems

et al. 2018

Learning When to Classify for Early Text Classification

Loyola

Errecalde

2018

Abstract. The problem of classification in supervised learning is a widely studied one. Nonetheless, there are scenarios that received little attention despite its applicability. One of such scenarios is early text classification, where one needs to know the category of a document as soon as possible. The importance of this variant of the classification problem is evident in tasks like sexual predator detection, where one wants to identify an offender as early as possible. This paper presents a framework for early text classification which highlights the two main pieces involved in this problem: classification with partial information and deciding the moment of classification. In this context, a novel approach that learns the second component (when classify) and an adaptation of a temporal measurement for multi-class problems are introduced. Results with a classical text classification corpus in comparison against a model that reads the entire documents confirm the feasibility of our approach.

Instance Selection in Text Classification Using the Silhouette Coefficient Measure

Dey

Solorio

2011

Abstract. The paper proposes the use of the Silhouette Coefficient (SC) as a ranking measure to perform instance selection in text classification. Our selection criterion was to keep instances with mid-range SC values while removing the instances with high and low SC values. We evaluated our hypothesis across three well-known datasets and various machine learning algorithms. The results show that our method helps to achieve the best trade-off between classification accuracy and training time.