Given an unsegmented multi‐author text, we wish to automatically separate out distinct authorial threads. We present a novel, entirely unsupervised, method that achieves strong results on multiple testbeds, including those for which authorial threads are topically identical. Unlike previous work, our method requires no specialized linguistic tools and can be easily applied to any text.
Abstract. An increasing amount of valuable data sources, advances in Internet of Things and Big Data technologies as well as the availability of a wide range of machine learning algorithms offers new potential to deliver analytical services to citizens and urban decision makers. However, there is still a gap in combining the current state of the art in an integrated framework that would help reducing development costs and enable new kind of services. In this chapter, we show how such an integrated Big Data analytical framework for Internet of Things and Smart City application could look like. The contributions of this chapter are threefold: (1) we provide an overview of Big Data and Internet of Things technologies including a summary of their relationships, (2) we present a case study in the smart grid domain that illustrates the high-level requirements towards such an analytical Big Data framework, and (3) we present an initial version of such a framework mainly addressing the volume and velocity challenge. The findings presented in this chapter are extended results from the EU funded project BIG and the German funded project PEC.
We introduce a new measure on linguistic features, called stability, which captures the extent to which a language element, such as a word or a syntactic construct, is replaceable by semantically equivalent elements.This measure may be perceived as quantifying the degree of available "synonymy" for a language item. We show that frequent but unstable features are especially useful as discriminators of an author's writing style. IntroductionOften we wish to find linguistic markers that distinguish the writing style of a particular author or class of authors. We seek features that are typically used variably by different authors but are used consistently by any given author -or, at least, by the author whose writing we wish to distinguish. Obviously, these markers will differ from author to author.In this paper, we wish to identify the pool of potentially useful linguistic features from which markers might fruitfully be chosen. To be precise, we do not seek features that distinguish a particular author but rather author-independent criteria for ranking features which are worth considering when seeking distinguishing characteristics of any given author.Consider some examples. If a particular author were found to use the word awful more frequently than other authors, this would certainly be worth noting. After all, the word bad is generally a reasonable, and more common, alternative to awful. The fact that our author chooses to use the word awful frequently therefore likely reflects deliberate stylistic choice that we might profitably exploit for identifying the author's writing. What about the word touchdown? The frequency of use of touchdown is a rather poor feature for identifying an author's style. There are actually two distinct reasons that this is so. First, there is no alternative word for expressing the concept touchdown so that its use does not reflect stylistic choice. Second, touchdown is tightly tied to a particular topic so that its frequency of use in one corpus is unlikely to reflect the frequency with which it will be used in other documents, which might concern other topics. By contrast, awful is commonly used across most topics and has plausible alternatives. Some words satisfy one of these two criteria but not both. For example, perspire offers a plausible alternative (sweat) but is not frequently used across a broad range of topics. On the other hand, words like blue and ten are used across topics but don't have common alternatives.To summarize, we seek linguistic features the use of which might reflect deliberate stylistic choice by a given author.Such features will tend to be used across topics and will offer plausible linguistic alternatives. The first criterion is relatively easy to approximate by checking overall frequency of use across different topic areas. We will focus on a formal definition of the second criteria: how can we determine the extent to which a particular linguistic feature offers alternatives? Although the examples we gave are all words, the criteria we define will ...
Given a multi-author document, we use unsupervised methods to identify distinct authorial threads. Although this problem is of great practical interest for security and forensic reasons, as well as for commercial purposes, this paper is, to the best of our knowledge, the first presentation of a general-purpose method for solving it.
Abstract-We show how an Arabic language religious-political document can be automatically classified according to the ideological stream and organizational affiliation that it represents. Tests show that our methods achieve near-perfect accuracy.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.