2016
DOI: 10.1007/978-3-319-30671-1_29
|View full text |Cite
|
Sign up to set email alerts
|

Who Wrote the Web? Revisiting Influential Author Identification Research Applicable to Information Retrieval

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
33
0

Year Published

2016
2016
2022
2022

Publication Types

Select...
4
3

Relationship

4
3

Authors

Journals

citations
Cited by 31 publications
(34 citation statements)
references
References 25 publications
1
33
0
Order By: Relevance
“…15 It works by representing each text as a vector of space-separated character n-gram counts, and comparing repeatedly sampled subvectors of known-author texts and snippets using cosine similarity. We use as a starting point the code from a reproducibility study [40], but have modified it to improve efficiency. (See Appendix B.2 for more details.…”
Section: Inference Mechanismsmentioning
confidence: 99%
See 1 more Smart Citation
“…15 It works by representing each text as a vector of space-separated character n-gram counts, and comparing repeatedly sampled subvectors of known-author texts and snippets using cosine similarity. We use as a starting point the code from a reproducibility study [40], but have modified it to improve efficiency. (See Appendix B.2 for more details.…”
Section: Inference Mechanismsmentioning
confidence: 99%
“…There are a few hyperparameters to the method (e.g. size of character n-grams, number of character n-grams in the feature vector); for the most part we use the hyperparameter settings of the replication we used as our starting point 17 [40], which were set on the basis of the empirical analysis of [27]. Only the minimum text length for training is changed to 400, given the length of our texts.…”
Section: B2 Inference Mechanism Implementation Detailsmentioning
confidence: 99%
“…The corpus included chat lines of potential pedophiles with the purpose of investigating the robustness of the best-performing systems also from this perspective (i.e., identifying the age of the pedophiles). Age classes included a gap in between: 10s (13)(14)(15)(16)(17), 20s (23)(24)(25)(26)(27), 30s (33-48). Results in both languages and in both subtasks were below 70% accuracy.…”
Section: Author Profilingmentioning
confidence: 99%
“…Author identification still poses a challenging empirical problem in fields related to information and computer science, but the underlying methods are nowadays also increasingly used as an auxiliary technology in more applied domains, such as literary studies or forensic linguistics. These communities crucially rely on trustworthy, transparent benchmark initiatives that reliably establish the state of the art in the field [17]. Author identification is concerned with the automated identification of the individual(s) who authored an anonymous document on the basis of text-internal properties related to language and writing style [9,12,27].…”
Section: Author Identificationmentioning
confidence: 99%
“…These benchmarks have had a significant impact on the community. In a recent large-scale reproducibility study on authorship attribution, they were employed to reimplement and reproduce the 15 most influential approaches from the past two decades, evaluating them on the standardised datasets [26]. The study finds that some of the approaches proposed early on are still competitive with the most recent contributions.…”
Section: Digital Text Forensics For Identificationmentioning
confidence: 99%