Authorship Attribution Using a Neural Network Language Model

Ge, Zhenhao; Sun, Yufang; Smith, M.J.T.

doi:10.1609/aaai.v30i1.9924

Cited by 16 publications

(6 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It was brought out that several features show temporal changes in increasing and decreasing order, and the language model for each author may differ. Also, Ge et al [40] explored language models using Neural Network Language Models (NNLMs) and compared their performance with the n-gram models (i.e., 4-gram). NNLM-based work achieves promising results compared with the N-gram models.…”

Section: Language Modelsmentioning

confidence: 99%

“…The results indicated a 97% success rate with the Levenberg Marguardtbased classifier. Ge et al [40] used a Feedforward Neural Network to create a lightweight language model that performed better than the baseline n-gram method on a limited dataset. Shrestha et al [118] presented a new model using Convolutional Neural Networks (CNN), which focused on the authorship attribution of short texts.…”

Section: Deep Learning Modelsmentioning

confidence: 99%

See 1 more Smart Citation

Authorship Attribution Methods, Challenges, and Future Research Directions: A Comprehensive Survey

He,

Lashkari,

Vombatkere

et al. 2024

Information

View full text Add to dashboard Cite

Over the past few decades, researchers have put their effort and paid significant attention to the authorship attribution field, as it plays an important role in software forensics analysis, plagiarism detection, security attack detection, and protection of trade secrets, patent claims, copyright infringement, or cases of software theft. It helps new researchers understand the state-of-the-art works on authorship attribution methods, identify and examine the emerging methods for authorship attribution, and discuss their key concepts, associated challenges, and potential future work that could help newcomers in this field. This paper comprehensively surveys authorship attribution methods and their key classifications, used feature types, available datasets, model evaluation criteria and metrics, and challenges and limitations. In addition, we discuss the potential future research directions of the authorship attribution field based on the insights and lessons learned from this survey work.

show abstract

Section: Language Modelsmentioning

confidence: 99%

Section: Deep Learning Modelsmentioning

confidence: 99%

Authorship Attribution Methods, Challenges, and Future Research Directions: A Comprehensive Survey

He,

Lashkari,

Vombatkere

et al. 2024

Information

View full text Add to dashboard Cite

show abstract

“…In order to improve accuracy, researchers are performing experiments on a variety of languages, leveraging diverse data sets, and presenting results with differing degrees of complexity. Ge et al [17] conducted forensic analysis on a vast distribution of Urdu corpus. It is tested with Latent Dirichlet Allocation (LDA) and cosine similarity to detect textual similarity.…”

Section: Literature Reviewmentioning

confidence: 99%

Policy-Based Spam Detection of Tweets Dataset

et al. 2023

View full text Add to dashboard Cite

Spam communications from spam ads and social media platforms such as Facebook, Twitter, and Instagram are increasing, making spam detection more popular. Many languages are used for spam review identification, including Chinese, Urdu, Roman Urdu, English, Turkish, etc.; however, there are fewer high-quality datasets available for Urdu. This is mainly because Urdu is less extensively used on social media networks such as Twitter, making it harder to collect huge volumes of relevant data. This paper investigates policy-based Urdu tweet spam detection. This study aims to collect over 1,100,000 real-time tweets from multiple users. The dataset is carefully filtered to comply with Twitter’s 100-tweet-per-hour limit. For data collection, the snscrape library is utilized, which is equipped with an API for accessing various attributes such as username, URL, and tweet content. Then, a machine learning pipeline consisting of TF-IDF, Count Vectorizer, and the following machine learning classifiers: multinomial naïve Bayes, support vector classifier RBF, logical regression, and BERT, are developed. Based on Twitter policy standards, feature extraction is performed, and the dataset is separated into training and testing sets for spam analysis. Experimental results show that the logistic regression classifier has achieved the highest accuracy, with an F1-score of 0.70 and an accuracy of 99.55%. The findings of the study show the effectiveness of policy-based spam detection in Urdu tweets using machine learning and BERT layer models and contribute to the development of a robust Urdu language social media spam detection method.

show abstract

“…The best performing technique was a character-level CNN, which were compared to other conventional approaches. In addition, a feedforward neural network language model was applied by Ge et al [51] to train a classifier to attribute a dataset with only a little amount of data. In comparison to n-gram baselines, the model trained a representation for each word using a window of four grams and achieved an accuracy of 95%.…”

Section: Related Studiesmentioning

confidence: 99%

Deep Learning-based Method for Enhancing the Detection of Arabic Authorship Attribution using Acoustic and Textual-based Features

Al-Sarem¹,

Saeed²,

Qasem³

et al. 2023

IJACSA

View full text Add to dashboard Cite

Authorship attribution (AA) is defined as the identification of the original author of an unseen text. It is found that the style of the author's writing can change from one topic to another, but the author's habits are still the same in different texts. The authorship attribution has been extensively studied for texts written in different languages such as English. However, few studies investigated the Arabic authorship attribution (AAA) due to the special challenges faced with the Arabic scripts. Additionally, there is a need to identify the authors of texts extracted from livestream broadcasting and the recorded speeches to protect the intellectual property of these authors. This paper aims to enhance the detection of Arabic authorship attribution by extracting different features and fusing the outputs of two deep learning models. The dataset used in this study was collected from the weekly livestream and recorded Arabic sermons that are available publicly on the official website of Al-Haramain in Saudi Arabia. The acoustic, textual and stylometric features were extracted for five authors. Then, the data were pre-processed and fed into the deep learning-based models (CNN architecture and its pre-trained ResNet34). After that the hard and soft voting ensemble methods were applied for combining the outputs of the applied models and improve the overall performance. The experimental results showed that the use of CNN with textual data obtained an acceptable performance using all evaluation metrics. Then, the performance of ResNet34 model with acoustic features outperformed the other models and obtained the accuracy of 90.34%. Finally, the results showed that the soft voting ensemble method enhanced the performance of AAA and outperformed the other method in terms of accuracy and precision, which obtained 93.19% and 0.9311 respectively.

show abstract

Authorship Attribution Using a Neural Network Language Model

Cited by 16 publications

References 5 publications

Authorship Attribution Methods, Challenges, and Future Research Directions: A Comprehensive Survey

Authorship Attribution Methods, Challenges, and Future Research Directions: A Comprehensive Survey

Policy-Based Spam Detection of Tweets Dataset

Deep Learning-based Method for Enhancing the Detection of Arabic Authorship Attribution using Acoustic and Textual-based Features

Contact Info

Product

Resources

About