Almost any conceivable authorship attribution problem can be reduced to one fundamental problem: whether a pair of (possibly short) documents were written by the same author. In this article, we offer an (almost) unsupervised method for solving this problem with surprisingly high accuracy. The main idea is to use repeated feature subsampling methods to determine if one document of the pair allows us to select the other from among a background set of "impostors" in a sufficiently robust manner.
We discuss a real-world application of a recently proposed machine learning method for authorship verification. Authorship verification is considered an extremely difficult task in computational text classification, because it does not assume that the correct author of an anonymous text is included in the candidate authors available. To determine whether 2 documents have been written by the same author, the verification method discussed uses repeated feature subsampling and a pool of impostor authors. We use this technique to attribute a newly discovered Latin text from antiquity (the Compendiosa expositio) to Apuleius. This North African writer was one of the most important authors of the Roman Empire in the 2 nd century and authored one of the world's first novels. This attribution has profound and wide-reaching cultural value, because it has been over a century since a new text by a major author from antiquity was discovered. This research therefore illustrates the rapidly growing potential of computational methods for studying the global textual heritage.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.