2013
DOI: 10.1093/llc/fqt047
|View full text |Cite
|
Sign up to set email alerts
|

Comparative evaluation of term selection functions for authorship attribution

Abstract: Different computational models have been proposed to automatically determine the most probable author of a disputed text (authorship attribution). These models can be viewed as special approaches in the text categorization domain. In this perspective, in a first step we need to determine the most effective features (words, punctuation symbols, part-of-speech, bigram of words, etc.) to discriminate between different authors. To achieve this, we can consider different independent feature-scoring selection functi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
25
0

Year Published

2015
2015
2019
2019

Publication Types

Select...
5

Relationship

5
0

Authors

Journals

citations
Cited by 43 publications
(26 citation statements)
references
References 26 publications
1
25
0
Order By: Relevance
“…Those terms are selected for the disputed text. For determining the value of k , previous studies have shown that a value between 200 and 300 tends to provide the best performance (Burrows, ; Savoy, ). This reduced number represents a huge difference compared to the 100,000 features used by Koppel and Winter () or compared to the features set size employed in the best systems employed in PAN CLEF 2014.…”
Section: Simple Verification Algorithmmentioning
confidence: 99%
See 1 more Smart Citation
“…Those terms are selected for the disputed text. For determining the value of k , previous studies have shown that a value between 200 and 300 tends to provide the best performance (Burrows, ; Savoy, ). This reduced number represents a huge difference compared to the 100,000 features used by Koppel and Winter () or compared to the features set size employed in the best systems employed in PAN CLEF 2014.…”
Section: Simple Verification Algorithmmentioning
confidence: 99%
“…This strategy has the drawback of ignoring some stylistic features such as POS distribution, complex sentence construction measures, or other type‐token ratios. On the other hand, simpler text representation approaches have the advantage of simplicity, have proven to be efficient (Burrows, ; Hoover, ; Savoy, ), and can be understood by the final user. After an attribution has been proposed by the system, the final user may require a justification (e.g., in a court decision).…”
Section: Simple Verification Algorithmmentioning
confidence: 99%
“…In the current study, we use only the k most frequent word‐types (with k = 300) to compute the intertextual distance to reflect the stylistic information. The relative frequencies of these very frequent terms, mainly composed of functional words, tend to represent the fingerprint of each particular author and have been found effective in various authorship attribution studies (Burrows, ; Juola, ; Zhao & Zobel, ; Savoy, ).…”
Section: Text Clustering Based On Stylistic Considerationsmentioning
confidence: 99%
“…This limit of 200 seems subjective. A recent study (Savoy, ) shows, however, that considering between 200 to 500 most frequent terms tends to produce the highest performance levels. Moreover, in the current case, some of the epistles are rather short (and some have fewer than 200 distinct words, for example, Philemon , 2 John ).…”
Section: Authorship Attribution and Clustering Experimentsmentioning
confidence: 98%