We are presenting a working system for automated news analysis that ingests an average total of 7600 news articles per day in five languages. For each language, the system detects the major news stories of the day using a group-average unsupervised agglomerative clustering process. It also tracks, for each cluster, related groups of articles published over the previous seven days, using a cosine of weighted terms. The system furthermore tracks related news across languages, in all language pairs involved. The cross-lingual news cluster similarity is based on a linear combination of three types of input: (a) cognates, (b) automatically detected references to geographical place names and (c) the results of a mapping process onto a multilingual classification system. A manual evaluation showed that the system produces good results.
We present a tool that extracts person names from multilingual news collections and matches name variants referring to the same person. A novel feature is the matching of name variants across languages and writing systems, including names written with the Greek, Cyrillic and Arabic writing system. Due to our highly multilingual setting, we use an internal standard representation for name representation and matching, instead of adopting the traditional bilingual approach to transliteration. This work is part of a news analysis system that clusters an average of 15,000 news articles per day to detect related news within the same and across different languages.After giving some background on name transliteration and referring to related work (Section Background and related work), we describe tools to identify names in text (Section Proper name recognition) and the mechanism to merge name variants, including those written in Cyrillic, Arabic, and Greek script (Section Detecting and merging name variants). This is followed by evaluation results (Section Evaluation). Table 1: Overview of a recognised person name in nine languages, showing various orthographies for the same person. The words in italics show the recognised trigger word(s).
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.