A new method for the recognition of meaningful changes in social state based on transformations of the linguistic content in Arabic newspapers is suggested. The detected alterations of the linguistic material in Arabic newspapers play an indicator role. The currently proposed approach acts in an “online” fashion and uses pre-trained vector representations of Arabic words. After a pre-processing stage, the words in the issues’ texts are substituted by vectors obtained within a word embedding methodology. The approach typifies the consistent linguistic template by the similarity of the embedded vectors. A change in the distributions of the issue-grounded samples indicates a difference in the underlying newspaper template. A two-step procedure implements the concept, where the first step compares the similarity distribution of the current issue versus the union of ones corresponding to several of its predecessors. A repeating under-sampling approach accompanied by a two-sample test stabilizes the sampling and returns a collection of the resultant p-values. In the second stage, the entropy of these sets is sequentially calculated, such that the change points of the time series obtained in this way indicate the changes in the newspaper content. Numerical experiments provided on the following issues of several Arabic newspapers published in the Arab Spring period demonstrate the high reliability of the method.
The paper proposes a new approach for intrinsic plagiarism detection, based on a new unique method, which enables identifying style changes in a text using novel chronology-based similarity measures. A model for finding significant deviations in the style across a given document is constructed aiming to indicate text parts which are suspected to be written by co-authors, or to be devoted to a different thematic, or to be a plagiarism. We consider each segment as "result of the text evolution" provided by its predecessors in the text. Resting upon this evolution standpoint, the metric evaluating dissimilarity between two given segments is introduced, and a text is clustered using this measure aiming to turn out disparity of the text. We also propose a new clustering procedure involving an embedding of data in an Euclidean space with subsequent clustering using the K-means approach. The obtained results demonstrate high ability of the method.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.