This paper reviews the Wikipedias for the smallest languages it is available for. We compute the most frequent shared articles (by automatically extracting their translation links), categories and other features. By analysing these data and aligning it tentatively with the literature, it is aimed at an understanding what interests could connect endangered languages. Which needs towards an emerging digital infrastructure could these results point to, is what the discussion is concerned with besides the crucial question of representativity of this data for small and endangered languages.
The challenge of POS tagging and lemmatization in morphologically rich languages is examined by comparing German and Latin. We start by defining an NLP evaluation roadmap to model the combination of tools and resources guiding our experiments. We focus on what a practitioner can expect when using state-of-the-art solutions. These solutions are then compared with old(er) methods and implementations for coarse-grained POS tagging, as well as fine-grained (morphological) POS tagging (e.g. case, number, mood). We examine to what degree recent advances in tagger development have improved accuracy – and at what cost, in terms of training and processing time. We also conduct in-domain vs. out-of-domain evaluation. Out-of-domain evaluation is particularly pertinent because the distribution of data to be tagged will typically differ from the distribution of data used to train the tagger. Pipeline tagging is then compared with a tagging approach that acknowledges dependencies between inflectional categories. Finally, we evaluate three lemmatization techniques.
Two goals are targeted by computer philology for ancient manuscript corpora: firstly, making an edition, that is roughly speaking one text version representing the whole corpus, which contains variety induced through copy errors and other processes and secondly, producing a stemma. A stemma is a graphbased visualization of the copy history with manuscripts as nodes and copy events as edges. Its root, the so-called archetype, is the supposed original text or urtext from which all subsequent copies are made. Our main contribution is to present one of the first computational approaches to automatic archetype reconstruction and to introduce the first textbased evaluation for automatically produced archetypes. We compare a philologically generated archetype with one generated by bioinformatic software.
We are investigating parts of the mathematical foundations of stemmatology, the science reconstructing the copying history of manuscripts. After Joseph Bédier in 1928 got suspicious about large amounts of root bifurcations he found in reconstructed stemmata, Paul Maas replied in 1937 using a mathematical argument that the proportion of root bifurcating stemmata among all possible stemmata is so large that one should not become suspicious to find them abundant. While Maas' argument was based on one example with a tradition of three surviving manuscripts, we show in this paper that for the whole class of trees corresponding to Maasian reconstructed stemmata and likewise for the class of trees corresponding to complete historical manuscript genealogies, root bifurcations are apriori the most expectable root degree type. We do this by providing a combinatorial formula for the numbers of possible so-called Greg trees according to their root degree (Flight, 1990). Additionally, for complete historical manuscript trees (regardless of loss), which coincide mathematically with rooted labeled trees, we provide formulas for root degrees and derive the asymptotic degree distribution. We find that root bifurcations are extremely numerous in both kinds of trees. Therefore, while previously other studies have shown that root bifurcations are expectable for true stemmata, we enhance this finding to all three philologically relevant types of trees discussed in breadth until today.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.