Abstract-Most studies on authorship identification reported a drop in the identification result when the number of authors exceeds 20-25. In this paper, we introduce a new user representation to address this problem and split classification across two layers. There are at least 3 novelties in this paper. First, the two-layer approach allows applying authorship identification over larger number of authors (tested over 100 authors), and it is extendable. The authors are divided into groups that contain smaller number of authors. Given an anonymous document, the primary layer detects the group to which the document belongs. Then, the secondary layer determines the particular author inside the selected group. In order to extract the groups linking similar authors, clustering is applied over users rather than documents. Hence, the second novelty of this paper is introducing a new user representation that is different from document representation. Without the proposed user representation, the clustering over documents will result in documents of author(s) distributed over several clusters, instead of a single cluster membership for each author. Third, the extracted clusters are descriptive and meaningful of their users as the dimensions have psychological backgrounds. For authorship identification, the documents are labelled with the extracted groups and fed into machine learning to build classification models that predicts the group and author of a given document. The results show that the documents are highly correlated with the extracted corresponding groups, and the proposed model can be accurately trained to determine the group and the author identity.
This paper introduces a two-layered framework that improves the result of authorship identification within larger sample numbers of bloggers as compared with earlier work. Previous studies are mainly divided into two categories: profile-based and instance-based methods. Each of these approaches has its advantages and limitations. The two-layered framework presented here integrates the two previous approaches and presents a new solution to a key problem in authorship identification, namely the drop in accuracy experienced as the number of authors increases. The paper begins by illustrating the regular instance-based core model and the investigated features. It then introduces a new psycholinguistic profile representation of authors, presents similarity grouping extraction over profiles, and applies blogger identification utilizing the two-layered approach. The results confirm the improvement introduced by the proposed two-layered approach against our regular classifier, as well as a selected baseline, for an extended number of users
PsychoNet 1 has demonstrated the feasibility of integrating psycholinguistic taxonomy, represented in LIWC, and its semantic textual representation in the form of commonsense ontology, represented in ConceptNet. However, various limitations exist in PsychoNet 1, including the lack of concluding context of the concept annotation. In this paper, we address most of those limitations and introduce a new enhanced and enriched version, PsychoNet 2. PsychoNet 2 utilizes WordNet, in addition to LIWC and ConceptNet, to produce an integrated contextualized psycholinguistic ontology. The first and the main contribution is that, in PsychoNet 2, each concept is annotated by the potential (most representative) contextual psycholinguistic categories, rather than all applicable categories. The second contribution is the enrichment of LIWC through utilizing WordNet. This in fact produced an enriched version of LIWC that may also be used independently in other applications. This has contributed to substantial enrichment of PsychoNet 2 as it facilitated including additional number of concepts that were not included in PsychoNet 1 due to lack of corresponding words in the original LIWC. A sample application of text classification, for a mood prediction task, is presented to demonstrate the introduced enhancements. The results confirm the improved performance of the new PsychoNet 2 against PsychoNet 1.
Abstract:Ontologies have been widely accepted as the most advanced knowledge representation model. This paper introduces PsychoNet, a new knowledgebase that forms the link between psycholinguistic taxonomy, existing in LIWC, and its semantic textual representation in the form of commonsense semantic ontology, represented by ConceptNet. The integration of LIWC and ConceptNet and the added functionalities facilitate employing ConceptNet in psycholinguistic studies. Furthermore, it simplifies utilization of the huge network of ConceptNet for a specific multimedia application based on key category(ies) from LIWC, such as visual or biological applications. PsychoNet adds a new layer of complementary psycholinguistic functions to the original semantic network. Moreover, learning, either clustering or classification, is more applicable in the developed ontology. The paper shows a sample application of text classification for mood prediction task. The result confirms the validity of the proposed network as PsychoNet outperforms LIWC in mood prediction.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.