With the widespread and increased consumption of online news, there is a rising need for automated analysis of news text. Topic models have proven to be useful tools for unsupervised discovery of topics from large amounts of text, including news media texts. Topics produced by a topic model are often represented as probability-weighted word lists, and it is expected that these bear correspondence to semantic topics-semantic concepts representable by a topic model. However, because the quality of topics varies and not all topics correspond to semantic topics in practice, much research effort has been devoted to automated evaluation of topic models. One class of popular and effective methods focuses on topic coherence as a measure of a topic's semantic interpretability and its correspondence to a semantic topic. Existing topic coherence methods calculate the coherence score based on the semantic similarity of topic-related words. However, news media texts revolve around specific news stories, giving rise to many contingent and transient topics for which topic-related words tend to be semantically unrelated. Consequently, the coherence of many news media topics is not amenable to detection via state-of-art word-based coherence measures. In this paper, we propose a novel class of topic coherence methods that estimate topic coherence based on topic documents rather than topic words. We evaluate the proposed methods on two datasets containing topics manually labeled for document-based coherence, derived from US and Croatian news text corpora. Our best-performing document-based coherence measure achieves an AUC score above 0.8, substantially outperforming a strong baseline method and state-of-art word-based coherence methods. We also demonstrate that there may be benefit in combining word-and document-based coherence measures. Lastly, we demonstrate the usefulness of document-based coherence measures for automated topic discovery from news media texts.
The contribution of domestic cattle in human societies is enormous, making cattle, along with other essential benefits, the economically most important domestic animal in the world today. To expand existing knowledge on cattle domestication and mitogenome diversity, we performed a comprehensive complete mitogenome analysis of the species (802 sequences, 114 breeds). A large sample was collected in South‐east Europe, an important agricultural gateway to Europe during Neolithization and a region rich in cattle biodiversity. We found 1725 polymorphic sites (810 singletons, 853 parsimony‐informative sites and 57 indels), 701 unique haplotypes, a haplotype diversity of 0.9995 and a nucleotide diversity of 0.0015. In addition to the dominant T 3 and several rare haplogroups (Q, T 5 , T 4 , T 2 and T 1 ), we have identified maternal line in Austrian Murbodner cattle that possess surviving aurochs’ mitochondria haplotype P 1 that diverged prior to the Neolithization process. This is convincing evidence for rare female‐mediated adaptive introgression of wild aurochs into domesticated cattle in Europe. We revalidated the existing haplogroup classification and provided Bayesian phylogenetic inference with a more precise estimated divergence time than previously available. Occasionally, classification based on partial mitogenomes was not reliable; for example, some individuals with haplogroups P and T 5 were not recognized based on D‐loop information. Bayesian skyline plot estimates (median) show that the earliest population growth began before domestication in cattle with haplogroup T 2 , followed by Q (~10.0–9.5 kyBP), whereas cattle with T 3 (~7.5 kyBP) and T 1 (~3.0–2.5 kyBP) expanded later. Overall, our results support the existence of interactions between aurochs and cattle during domestication and dispersal of cattle in the past, contribute to the conservation of maternal cattle diversity and enable functional analyses of the surviving aurochs P 1 mitogenome.
Abstract. We present a very efficient, in terms of space and access speed, data structure for storing huge natural language data sets. The structure is described as LZ (Ziv Lempel) compressed linked list trie and is a step further beyond directed acyclic word graph in automata compression. We are using the structure to store DELAF, a huge French lexicon with syntactical, grammatical and lexical information associated with each word. The compressed structure can be produced in O(N) time using suffix trees for finding repetitions in trie, but for large data sets space requirements are more prohibitive than time so suffix arrays are used instead, with compression time complexity O(N log N) for all but for the largest data sets.
An efficient algorithm for trie compression has already been described. Here we present its practical value and demonstrate its superiority in terms of space savings to other methods of lexicon compression. Apart from simple lexicons, a compressed trie can, with some additional processing, be used as a component in the compact representation of simple static databases. We present the potential of the algorithm in compressing natural language dictionaries.
BackgroundIdentification of genes or even nucleotides that are responsible for quantitative and adaptive trait variation is a difficult task due to the complex interdependence between a large number of genetic and environmental factors. The polymorphism of the mitogenome is one of the factors that can contribute to quantitative trait variation. However, the effects of the mitogenome have not been comprehensively studied, since large numbers of mitogenome sequences and recorded phenotypes are required to reach the adequate power of analysis. Current research in our group focuses on acquiring the necessary mitochondria sequence information and analysing its influence on the phenotype of a quantitative trait. To facilitate these tasks we have produced software for processing pedigrees that is optimised for maternal lineage analysis.ResultsWe present MaGelLAn 1.0 (maternal genealogy lineage analyser), a suite of four Python scripts (modules) that is designed to facilitate the analysis of the impact of mitogenome polymorphism on quantitative trait variation by combining molecular and pedigree information. MaGelLAn 1.0 is primarily used to: (1) optimise the sampling strategy for molecular analyses; (2) identify and correct pedigree inconsistencies; and (3) identify maternal lineages and assign the corresponding mitogenome sequences to all individuals in the pedigree, this information being used as input to any of the standard software for quantitative genetic (association) analysis. In addition, MaGelLAn 1.0 allows computing the mitogenome (maternal) effective population sizes and probability of mitogenome (maternal) identity that are useful for conservation management of small populations.ConclusionsMaGelLAn is the first tool for pedigree analysis that focuses on quantitative genetic analyses of mitogenome data. It is conceived with the purpose to significantly reduce the effort in handling and preparing large pedigrees for processing the information linked to maternal lines. The software source code, along with the manual and the example files can be downloaded at http://lissp.irb.hr/software/magellan-1-0/ and https://github.com/sristov/magellan.Electronic supplementary materialThe online version of this article (doi:10.1186/s12711-016-0242-9) contains supplementary material, which is available to authorized users.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.