Accurate semantic representation models are essential in text mining applications. For a successful application of the text mining process, the text representation adopted must keep the interesting patterns to be discovered. Although competitive results for automatic text classification may be achieved with traditional bag of words, such representation model cannot provide satisfactory classification performances on hard settings where richer text representations are required. In this paper, we present an approach to represent document collections based on embedded representations of words and word senses. We bring together the power of word sense disambiguation and the semantic richness of word-and wordsense embedded vectors to construct embedded representations of document collections. Our approach results in semantically enhanced and low-dimensional representations. We overcome the lack of interpretability of embedded vectors, which is a drawback of this kind of representation, with the use of word sense embedded vectors. Moreover, the experimental evaluation indicates that the use of the proposed representations provides stable classifiers with strong quantitative results, especially in semantically-complex classification scenarios.
Agradeço por todas as oportunidades que tive e a todas as pessoas que conheci. Agradeço à minha família, em especial, meus pais Eliane e Mario, meus irmãos Fabio e Guilherme, e meu marido Nabil. Por todo amor, dedicação, apoio e compreensão! Agradeço à minha orientadora Solange Rezende, amiga e incentivadora. Pela orientação acadêmica, profissional e pessoal, pela atenção, e por sempre procurar entender as características individuais de cada um de seus alunos! Agradeço à Professora Maria Carolina Monard e aos amigos do LABIC, aqueles da minha primeira passagem pela pesquisa. Por deixarem uma marca especial na minha vida, que me fez querer voltar. E agradeço a todos os novos amigos do LABIC, aqueles que conheci nos últimos anos, pela troca de conhecimento e experiências, pelo companheirismo, e pelas conversas descontraídas na hora do café. Agradeço a gentil ajuda no início do doutorado, as parcerias, os trabalhos conjuntos e as revisões de
Text clustering is a text mining task which is often used to aid the organization, knowledge extraction, and exploratory search of text collections. Nowadays, the automatic text clustering becomes essential as the volume and variety of digital text documents increase, either in social networks and the Web or inside organizations. This paper explores the use of named entities as privileged information in a hierarchical clustering process, so as to improve clusters quality and interpretation. We carried out an experimental evaluation on three text collections (one written in Portuguese and two written in English) and the results show that named entities can be applied as privileged information to power clustering solution in dynamic text collection scenarios.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.