In authorship attribution one assigns texts from an unknown author to either one of two or more candidate authors by comparing the disputed texts with texts known to have been written by the candidate authors. In authorship verification one decides whether a text or a set of texts could have been written by a given author. These two problems are usually treated separately. By assuming an open-set classification framework for the attribution problem, contemplating the possibility that none of the candidate authors is the unknown author, the verification problem becomes a special case of attribution problem. Here both problems are posed as a formal Bayesian multinomial model selection problem and are given a closed form solution, tailored for categorical data, naturally incorporating text length and dependence in the analysis, and coping well with settings with a small number of training texts. The approach to authorship verification is illustrated by exploring whether a court ruling sentence could have been written by the judge that signs it, and the approach to authorship attribution is illustrated by revisiting the authorship attribution of the Federalist papers and through a small simulation study.
The zero truncated inverse Gaussian-Poisson model, obtained by first mixing the Poisson model assuming its expected value has an inverse Gaussian distribution and then truncating the model at zero, is very useful when modelling frequency count data.A Bayesian analysis based on this statistical model is implemented on the word frequency counts of various texts, and its validity is checked by exploring the posterior distribution of the Pearson errors and by implementing posterior predictive consistency checks. The analysis based on this model is useful because it allows one to use the posterior distribution of the model mixing density as an approximation of the posterior distribution of the density of the word frequencies of the vocabulary of the author, which is useful to characterize the style of that author. The posterior distribution of the expectation and of measures of the variability of that mixing distribution can be used to assess the size and diversity of his vocabulary. An alternative analysis is proposed based on the inverse Gaussian-zero truncated Poisson mixture model, which is obtained by switching the order of the mixing and the truncation stages. Even though this second model fits some of the word frequency data sets more accurately than the first model, in practice the analysis based on it is not as useful because it does not allow one to estimate the word frequency distribution of the vocabulary.
The statistical analysis of the heterogeneity of the style of a text often leads to the analysis of contingency tables of ordered rows. When multiple authorship is suspected, one can explore that heterogeneity through either a change-point analysis of these rows, consistent with sudden changes of author, or a cluster analysis of them, consistent with authors contributing exchangeably, without taking order into consideration. Here an analysis is proposed that strikes a compromise between change-point and cluster analysis by incorporating the fact that parts close together are more likely to belong to the same author than parts far apart.The approach is illustrated by revisiting the authorship attribution of Tirant lo Blanc.
The analysis of word frequency count data can be very useful in authorship attribution problems. Zero-truncated generalized inverse Gaussian-Poisson mixture models are very helpful in the analysis of these kinds of data because their model-mixing density estimates can be used as estimates of the density of the word frequencies of the vocabulary. It is found that this model provides excellent fits for the word frequency counts of very long texts, where the truncated inverse Gaussian-Poisson special case fails because it does not allow for the large degree of over-dispersion in the data. The role played by the three parameters of this truncated GIG-Poisson model is also explored. Our second goal is to compare the fit of the truncated GIG-Poisson mixture model with the fit of the model that results from switching the order of the mixing and truncation stages. A heuristic interpretation of the mixing distribution estimates obtained under this alternative GIG-truncated Poisson mixture model is also provided.categorical data, generalized inverse Gaussian, mixture model, Poisson mixture, stylometry, truncated model, truncated mixture, word frequency,
We proposed statistical analysis of the heterogeneity of literary style in a set of texts that simultaneously use different stylometric characteristics, like word length and the frequency of function words. The data set consists of several tables with the same number of rows, with the i-th row of all tables corresponding to the i-th text. The analysis proposed clusters the rows of all these tables simultaneously into groups with homogeneous style, based on a finite mixture of sets of multinomial models, one set for each table.Different from the usual heuristic cluster analysis approaches, our method naturally incorporates the text size, the discrete nature of the data, and the dependence between categories in the analysis. The model is checked and chosen with the help of posterior predictive checks, together with the use of closed form expressions for the posterior probabilities that each of the models considered to be appropriate. This is illustrated through an analysis of the heterogeneity in Shakespeare's plays, and by revisiting the authorshipattribution problem of Tirant lo Blanc.Key words: Authorship, Cluster analysis, Multinomial distribution. ResumenSe propone un análisis estadístico para modelar la heterogeneidad del estilo literario en un conjunto de textos, para ello se utilizan simultánea-mente diferentes características estilométricas, como longitud de palabra y la frecuencia de palabras función. Los datos consisten en varias tablas con el mismo número de filas, donde la fila i-ésima corresponde al texto i-ésimo. El análisis propuesto agrupa las filas de todas estas tablas simultáneamente en a Professor. E-mail: marti.font@upc.edu b Professor. E-mail: xavier.puig@upc.edu c Professor. E-mail: josep.ginebra@upc.edu 206Marti Font, Xavier Puig & Josep Ginebra grupos de estilo homogéneo, en base a una mezcla finita de modelos multinomiales.El modelo propuesto tiene la ventaja sobre los análisis de conglomerados heurísticos habituales, de incorporar de forma natural el tamaño del texto, la naturaleza discreta de los datos y la dependencia entre las categorías. El modelo se selecciona y válida con la ayuda de simulaciones de la distribución predictiva a posteriori, junto con el uso de las expresiones en forma cerrada para la probabilidad a posteriori de cada uno de los modelos de mezcla considerados. Todo ello se ilustra a través de un análisis de la heterogeneidad en las obras de Shakespeare, y revisitando el problema de atribución de autoría del texto Tirant lo Blanc.Palabras clave: análisi de conglomerados, atribución, distribución multinomial.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.