2022
DOI: 10.3390/math10020277
|View full text |Cite
|
Sign up to set email alerts
|

Graph-Based Siamese Network for Authorship Verification

Abstract: In this work, we propose a novel approach to solve the authorship identification task on a cross-topic and open-set scenario. Authorship verification is the task of determining whether or not two texts were written by the same author. We model the documents in a graph representation and then a graph neural network extracts relevant features from these graph representations. We present three strategies to represent the texts as graphs based on the co-occurrence of the POS labels of words. We propose a Siamese N… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
3
3

Relationship

0
9

Authors

Journals

citations
Cited by 10 publications
(5 citation statements)
references
References 27 publications
0
5
0
Order By: Relevance
“…One of the difficulties in comparing prior work is the use of different performance metrics. Some examples are accuracy (Altakrori et al, 2021;Stamatatos, 2018;Jafariakinabad and Hua, 2022;Fabien et al, 2020;Saedi and Dras, 2021;Zhang et al, 2018;Barlas and Stamatatos, 2020), F1 (Murauer and Specht, 2021), C@1 (Bagnall, 2015), recall (Lagutina, 2021), precision (Lagutina, 2021), macro-accuracy (Bischoff et al, 2020), AUC (Bagnall, 2015;Pratanwanich and Lio, 2014), R@8 (Rivera-Soto et al, 2021), and the unweighted average of F1, F0.5u, C@1, and AUC (Manolache et al, 2021;Kestemont et al, 2021;Tyo et al, 2021;Futrzynski, 2021;Peng et al, 2021;Bönninghoff et al, 2021;Boenninghoff et al, 2020;Embarcadero-Ruiz et al, 2022;Weerasinghe et al, 2021).…”
Section: Metricsmentioning
confidence: 99%
“…One of the difficulties in comparing prior work is the use of different performance metrics. Some examples are accuracy (Altakrori et al, 2021;Stamatatos, 2018;Jafariakinabad and Hua, 2022;Fabien et al, 2020;Saedi and Dras, 2021;Zhang et al, 2018;Barlas and Stamatatos, 2020), F1 (Murauer and Specht, 2021), C@1 (Bagnall, 2015), recall (Lagutina, 2021), precision (Lagutina, 2021), macro-accuracy (Bischoff et al, 2020), AUC (Bagnall, 2015;Pratanwanich and Lio, 2014), R@8 (Rivera-Soto et al, 2021), and the unweighted average of F1, F0.5u, C@1, and AUC (Manolache et al, 2021;Kestemont et al, 2021;Tyo et al, 2021;Futrzynski, 2021;Peng et al, 2021;Bönninghoff et al, 2021;Boenninghoff et al, 2020;Embarcadero-Ruiz et al, 2022;Weerasinghe et al, 2021).…”
Section: Metricsmentioning
confidence: 99%
“…Table 6 shows the BigBird Cross-Encoder performance using these training datasets compared to the official results of the PAN20/21 challenge top participant systems (Bevendorff and et al, 2021). These include hybrid neural-probabilistic, neural network-based, logistic regression, and graphbased Siamese network systems (Boenninghoff et al, 2020(Boenninghoff et al, , 2021Weerasinghe and Greenstadt, 2020;Embarcadero-Ruiz et al, 2021). Note here the systems submitted by the same team are not necessarily the same across PAN20 and PAN21 because some systems used for the PAN20 closedset challenge relied on fandom information.…”
Section: Bigbirdmentioning
confidence: 99%
“…The second column of Table 1 presents the number of words per language sub-collection, totaling in 58,061,996 for these 7 languages, while the third column contains the number of tokens, totaling in 73,692,461. For the purpose of this experiment, we produced four document representations for each novel, each in the form of vertical texts, consisting of: (1) words (as in vertical original text of the novel), ( 2) lemmas (as in vertical lemmatized text), (3) PoS tags (each token in verticalized text is replaced by its PoS tag) and ( 4) masked text, where tokens were substituted with PoS tag for following PoS tags: ADJ, NOUNS, NPROP, ADV, VERB, AUX, NUM, SYM, X, for PoS tags: DET and PRON tokens are substituted with lemma, while others: ADP, CCONJ, INTJ, PART, PUNCT, SCONJ remained unchanged, as inspired by [35].…”
Section: Datasetmentioning
confidence: 99%
“…Recently, however, for some highly-inflected languages, most frequent lemmas emerged as a better alternative to most frequent words [39]. The PoS tags and the document representation with masked words, where PoS labels are used to mask predefined set of PoS classes, also achieved good results for specific problems [35]. In evaluation of this experiment we used the following document representations: most frequent words, lemmas, PoS trigrams, and PoS-masked bigrams (D word , D lemma , D pos and D masked ), as the secondary baseline methods.…”
Section: Baselinementioning
confidence: 99%