Automated Authorship Attribution Using Advanced Signal Classification Techniques

Ebrahimpour, Maryam; Putniņš, Tālis J.; Berryman, Matthew J.; Allison, Andrew; Ng, Brian W.-H.; Abbott, Derek

doi:10.1371/journal.pone.0054998

Cited by 28 publications

(20 citation statements)

References 28 publications

(25 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…More specifically, we show that the symmetry os specific words is able to identify the writing style of distinct authors. In the context of information sciences, the authorship recognition task is relevant because it can be useful to classify literary manuscripts [28] and intercept terrorist messages [29]. Traditional features employed for stylometric analysis include simple statistics such as the average length and frequency of words [30], richness of vocabulary size [30] and burstiness indexes [7].…”

Section: Pattern Recognition Methodsmentioning

confidence: 99%

Concentric network symmetry grasps authors' styles in word adjacency networks

Amancio¹,

Silva²,

Costa³

2015

EPL

View full text Add to dashboard Cite

Several characteristics of written texts have been inferred from statistical analysis derived from networked models. Even though many network measurements have been adapted to study textual properties at several levels of complexity, some textual aspects have been disregarded. In this paper, we study the symmetry of word adjacency networks, a well-known representation of text as a graph. A statistical analysis of the symmetry distribution performed in several novels showed that most of the words do not display symmetric patterns of connectivity. More specifically, the merged symmetry displayed a distribution similar to the ubiquitous power-law distribution. Our experiments also revealed that the studied metrics do not correlate with other traditional network measurements, such as the degree or betweenness centrality. The effectiveness of the symmetry measurements was verified in the authorship attribution task. Interestingly, we found that specific authors prefer particular types of symmetric motifs. As a consequence, the authorship of books could be accurately identified in 82.5% of the cases, in a dataset comprising books written by 8 authors. Because the proposed measurements for text analysis are complementary to the traditional approach, they can be used to improve the characterization of text networks, which might be useful for applications such as identification of topical words and information retrieval.

show abstract

Section: Pattern Recognition Methodsmentioning

confidence: 99%

Concentric network symmetry grasps authors' styles in word adjacency networks

Amancio¹,

Silva²,

Costa³

2015

EPL

View full text Add to dashboard Cite

show abstract

“…Previous research on individual differences in word choice has focused on written text and function words (e.g., Ebrahimpour et al, 2013;Koppel et al, 2009;Stamatatos, 2009). Content words, such as table and sleeping, and word combinations, such as old tree, are very context dependent.…”

Section: Introductionmentioning

confidence: 99%

Choice and pronunciation of words: Individual differences within a homogeneous group of speakers

Hanique¹,

Ernestus

Boves³

2015

Corpus Linguistics and Linguistic Theory

View full text Add to dashboard Cite

This paper investigates whether individual speakers forming a homogeneous group differ in their choice and pronunciation of words when engaged in casual conversation, and if so, how they differ. More specifically, it examines whether the Balanced Winnow classifier is able to distinguish between the twenty speakers of the Ernestus Corpus of Spontaneous Dutch, who all have the same social background. To examine differences in choice and pronunciation of words, instead of characteristics of the speech signal itself, classification was based on lexical and pronunciation features extracted from hand-made orthographic and automatically generated broad phonetic transcriptions. The lexical features consisted of words and two-word combinations. The pronunciation features represented pronunciation variations at the word and phone level that are typical for casual speech. The best classifier achieved a performance of 79.9% and was based on the lexical features and on the pronunciation features representing single phones and triphones. The speakers must thus differ from each other in these features. Inspection of the relevant features indicated that, among other things, the words relevant for classification generally do not contain much semantic content, and that speakers differ not only from each other in the use of these words but also in their pronunciation.

show abstract

“…The approaches to authorship identification can combine accumulated knowledge from the theory of image recognition, mathematical statistics and probability theory, neural networks, cluster analysis, Markov chains, and others [6][7][8][9][10][11]. Paper [6] studies the state of the problem today; it is noted that if there are texts by 3-4 authors in the training and testing samples, trained classifiers confidently demonstrate up to 85 % of the accuracy of identification of authorship of a text in the test sample.…”

Section: Literature Review and Problem Statementmentioning

confidence: 99%

Identification of authorship of Ukrainian-language texts of journalistic style using neural networks

Lupei¹,

Міца²,

Repariuk³

et al. 2020

EEJET

View full text Add to dashboard Cite

Дослiджується проблема розробки ефективного способу визначення авторства текстiв (на матерiалi публiкацiй вiдомих українських журналiстiв). Бiльшiсть наявних методiв потребують попередньої обробки тексту, що тягне за собою новi витрати при розв'язаннi поставленої задачi. У випадку, коли кiлькiсть можливих авторiв можна мiнiмiзувати, такий пiдхiд є часто надлишковим. Ще одним недолiком наявних пiдходiв є те, що переважна бiльшiсть їх застосовувалися до iншомовних текстiв i не враховували особливостей української мови. Тому було вирiшено розробити пiдхiд, що дозволяє визначити автора тексту українською мовою без попередньої обробки та дає високi результати точностi, а також встановити, якi типи штучних нейронних мереж забезпечують мiнiмальну похибку для українських публiцистiв. Розроблений метод використовує багатошаровий персептрон прямого поширення, алгоритм навчання з учителем, векторизацiю HashingVectoriser, оптимiзатор Adam. Визначено, що при невеликiй кiлькостi iтерацiй (4-5 iтерацiй) навчання штучної нейронної мережi отримується досить висока точнiсть визначення авторства публiцистичних текстiв та досить мале значення похибки. Використано бiльше 1000 фрагментiв текстiв трьох українських авторiв. У результатi проведених експериментiв було встановлено, що застосовування розробленого пiдходу до розв'язання поставленої задачi дає змогу досягти досить високих результатiв. У текстах, що мiстять не менше 500 символiв, точнiсть сягає 91 %, а максимальна кiлькiсть iтерацiй навчання штучної нейронної мережi при цьому не перевищує 15. Такi результати досягнутi насамперед завдяки ефективному пiдбору методу векторизацiї на пiдготовчому етапi та структури штучної нейронної мережi Ключовi слова: визначення авторства, аналiз тексту, штучнi нейроннi мережi, багатошаровий персептрон, векторизацiя тексту

show abstract

Automated Authorship Attribution Using Advanced Signal Classification Techniques

Cited by 28 publications

References 28 publications

Concentric network symmetry grasps authors' styles in word adjacency networks

Concentric network symmetry grasps authors' styles in word adjacency networks

Choice and pronunciation of words: Individual differences within a homogeneous group of speakers

Identification of authorship of Ukrainian-language texts of journalistic style using neural networks

Contact Info

Product

Resources

About