We present a model to perform authorship attribution of tweets using Convolutional Neural Networks (CNNs) over character n-grams. We also present a strategy that improves model interpretability by estimating the importance of input text fragments in the predicted classification. The experimental evaluation shows that text CNNs perform competitively and are able to outperform previous methods.
Character n-grams have been identified as the most successful feature in both singledomain and cross-domain Authorship Attribution (AA), but the reasons for their discriminative value were not fully understood. We identify subgroups of character n-grams that correspond to linguistic aspects commonly claimed to be covered by these features: morphosyntax, thematic content and style. We evaluate the predictiveness of each of these groups in two AA settings: a single domain setting and a cross-domain setting where multiple topics are present. We demonstrate that character ngrams that capture information about affixes and punctuation account for almost all of the power of character n-grams as features. Our study contributes new insights into the use of n-grams for future AA work and other classification tasks.
With the increasing storage of images worldwide, automatic image annotation has become a very active and relevant research area, however, it still lacks a benchmark specifically designed for this task, and in particular for region-level annotation. In this report we introduce the segmented and annotated IAPR-TC12 benchmark, an extended resource for the evaluation of automatic image annotation (AIA) methods. We present a methodology for the manual segmentation and annotation of the images in this collection. The goal of this methodology is to obtain reliable ground truth data for benchmarking AIA and related tasks. For annotation, an ad-hoc vocabulary is defined and hierarchically organized. This hierarchy proved to be very useful for obtaining objective and structured annotations. Also, a soft measure for the evaluation of annotation performance is proposed, based on this hierarchy. Statistics on the segmentation and annotation processes give evidence of the reliability of the proposed approach. Visual attributes and spatial relations are also extracted from regions in segmented images. The latter feature will promote research on the use of (spatial) contextual information for AIA and image retrieval. The extended collection is publicly available and can be used to evaluate a variety of tasks besides image annotation; this resource can also serve to study the use of automatic annotations for multimedia image retrieval; the latter is a distinctive feature of the collection because, although there are several image annotation benchmarks, there is currently no collection that can be used to effectively evaluate the performance of annotation methods in the task they are designed for (i.e. image retrieval). We outline several applications and raise important questions that might be answered with the annotated collection; motivating research in the areas of image segmentation, annotation and retrieval as well as on machine learning.
Abstract-This paper elaborates on the benefits of using particle swarm model selection (PSMS) for building effective ensemble classification models. PSMS searches in a toolbox for the best combination of methods for preprocessing, feature selection and classification for generic binary classification tasks. Throughout the search process PSMS evaluates a wide variety of models, from which a single solution (i.e. the best classification model) is selected. Satisfactory results have been reported with the latter formulation in several domains. However, many models that are potentially useful for classification are disregarded for the final model. In this paper we propose to re-use such candidate models for building effective ensemble classifiers. We explore three simple formulations for building ensembles from intermediate PSMS solutions that do not require of further computation than that of the traditional PSMS implementation. We report experimental results on benchmark data as well as on a data set from object recognition. Our results show that better models can be obtained with the ensemble version of PSMS, motivating further research on the combination of candidate PSMS models. Additionally, we analyze the diversity of the classification models, which is known to be an important factor for the construction of ensembles.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.