Text representation models are the fundamental basis for information retrieval and text mining tasks. Although different text models have been proposed, they typically target specific task aspects in isolation, such as time efficiency, accuracy, or applicability for different scenarios. Here we present Bag of Textual Graphs (BoTG), a general text representation model that addresses these three requirements at the same time. The proposed textual representation is based on a graph‐based scheme that encodes term proximity and term ordering, and represents text documents into an efficient vector space that addresses all these aspects as well as provides discriminative textual patterns. Extensive experiments are conducted in two experimental scenarios—classification and retrieval—considering multiple well‐known text collections. We also compare our model against several methods from the literature. Experimental results demonstrate that our model is generic enough to handle different tasks and collections. It is also more efficient than the widely used state‐of‐the‐art methods in textual classification and retrieval tasks, with a competitive effectiveness, sometimes with gains by large margins.
The data deduplication task has attracted a considerable amount of attention from the research community in order to provide effective and efficient solutions. The information provided by the user to tune the deduplication process is usually represented by a set of manually labeled pairs. In very large datasets, producing this kind of labeled set is a daunting task since it requires an expert to select and label a large number of informative pairs. In this article, we propose a two-stage sampling selection strategy (T3S) that selects a reduced set of pairs to tune the deduplication process in large datasets. T3S selects the most representative pairs by following two stages. In the first stage, we propose a strategy to produce balanced subsets of candidate pairs for labeling. In the second stage, an active selection is incrementally invoked to remove the redundant pairs in the subsets created in the first stage in order to produce an even smaller and more informative training set. This training set is effectively used both to identify where the most ambiguous pairs lie and to configure the classification approaches. Our evaluation shows that T3S is able to reduce the labeling effort substantially while achieving a competitive or superior matching quality when compared with state-of-the-art deduplication methods in large datasets.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.