2013
DOI: 10.1371/journal.pone.0066813
|View full text |Cite
|
Sign up to set email alerts
|

Language Individuation and Marker Words: Shakespeare and His Maxwell's Demon

Abstract: BackgroundWithin the structural and grammatical bounds of a common language, all authors develop their own distinctive writing styles. Whether the relative occurrence of common words can be measured to produce accurate models of authorship is of particular interest. This work introduces a new score that helps to highlight such variations in word occurrence, and is applied to produce models of authorship of a large group of plays from the Shakespearean era.MethodologyA text corpus containing 55,055 unique words… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
16
0

Year Published

2014
2014
2022
2022

Publication Types

Select...
5
3

Relationship

4
4

Authors

Journals

citations
Cited by 16 publications
(16 citation statements)
references
References 28 publications
0
16
0
Order By: Relevance
“…In order to highlight the individual and most clearly identifiable characteristics of each cluster generated by the MST-kNN agglomerative algorithm, a new score introduced in the area of Computational Linguistics [ 19 ]; and Bioinformatics [ 20 ] was computed. In other words, the use of the CM1 score in this study is to find the most salient features for each cluster.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…In order to highlight the individual and most clearly identifiable characteristics of each cluster generated by the MST-kNN agglomerative algorithm, a new score introduced in the area of Computational Linguistics [ 19 ]; and Bioinformatics [ 20 ] was computed. In other words, the use of the CM1 score in this study is to find the most salient features for each cluster.…”
Section: Methodsmentioning
confidence: 99%
“…This difference is moderated by the range of values observed in members of all the other clusters, Y , which has a greater set of samples, instead of the combined standard deviation of the specific cluster, X , and all the other clusters together; Y . For specific details of this score we refer to a study published by Marsden et al [ 19 ]. …”
Section: Methodsmentioning
confidence: 99%
“…We compared the performance of the most successful setup of FSPMA (MA3 algorithm with MineMink and MST) with nine univariate and multivariate feature selection methods. Among the univariate supervised methods chosen are chi‐square feature selection ( Chi ), information gain ratio ( GainRatio ), information gain ( IG ), ReliefF , symmetrical uncertainty ( SU ) from WEKA, and CM1 score . Among the multivariate supervised methods, we selected CFS , consistency subset feature selection ( Consistency ) from WEKA, and the ( α , β )‐ k ‐feature set .…”
Section: Computational Resultsmentioning
confidence: 99%
“…Among the univariate supervised methods chosen are chi-square feature selection (Chi), information gain ratio (GainRatio), information gain (IG), ReliefF, symmetrical uncertainty (SU) from WEKA, and CM1 score. 39 Among the multivariate supervised methods, we selected CFS, consistency subset feature selection (Consistency) from WEKA, and the ( , )-k-feature set. 40 We implemented the ( , )-k-feature set and CM1 score methods, and for all other methods, we adopted the WEKA implementation using their default configurations.…”
Section: Comparison With Other Feature Selection Methodsmentioning
confidence: 99%
“…Researchers have debated the merits of culling word lists according to various rules as opposed to using all the words within a given category [9] , [17] , [18] . In a previous research, we demonstrated that such consideration can potentially reflect the authors' individuality and style [19] . In contrast, this research is carried out considering all the words –including stop words.…”
Section: Introductionmentioning
confidence: 95%