This paper is a contribution to the discussion on compiling computational lexical resources from conventional dictionaries. It describes the theoretical as well as practical problems that are encountered when reusing a conventional dictionary for compiling a lexical-semantic resource in terms of a wordnet. More specifically, it describes the methodological issues of compiling a wordnet for Danish, DanNet, from a monolingual basis, and not-as is often seen-by applying the translational expansion method with Princeton WordNet as the English source. Thus, we apply as our basis a large, corpus-based printed dictionary of modern Danish. Using this approach, we discuss the issues of readjusting inconsistent and/or underspecified hyponymy hierarchies taken from the conventional dictionary, sense distinctions as opposed to the synonym sets of wordnets, generating semantic wordnet relations on the basis of sense definitions, and finally, supplementing missing or implicit information.
rispevek se osredotoča na preučitev razmerja med dnevniki iskanj uporabnikov po spletnem slovarju in korpusno pogostostjo besed. Študijo so spodbudila razmišljanja, ki so se porajala pri rednem slovarskem delu in jih lahko strnemo v vprašanje: kako ohranjati na korpusu temelječ slovar aktualen? Bi morala biti naslednja beseda, ki jo uvrstimo v slovar, tista, ki sledi zadnji uslovarjeni besedi na frekvenčnem seznamu besed iz korpusa? Ali bi morala biti to beseda, ki jo uporabniki najpogosteje neuspešno iščejo v slovarju? Da bi prišli do ustreznih kriterijev, so avtorji analizirali dnevnike iskanj uporabnikov danskega slovarja v obdobju od 2009 do 2012 in seznam najpogosteje iskanih besed primerjali z njihovo pogostostjo v korpusu. S proučitvijo iskalnih navad uporabnikov so avtorji želeli priti do odgovorov na sledeča vprašanja: Ali so v slovarju besede, ki jih uporabniki nikoli ne iščejo? Če je odgovor da, ali lahko na podlagi njihove pogostosti v korpusu opazimo kakšne smiselne vzorce – gre za besede iste besedne vrste, so besede zelo pogoste ali zelo redke, se pojavljajo v določenem frekvenčnem območju? Ugotovitev prispevka je, da je pogostost v korpusu dober kriterij za 20.000 najpogostejših iztočnic, medtem ko je treba pri manj pogostih besedah dodati še druge metode, med katerimi je tudi pregled iskanj uporabnikov, nadvse pomembna pa je tudi presoja leksikografov.
We investigate a method of updating a Danish monolingual dictionary with new semantic information on already included lemmas in a systematic way, based on the hypothesis that the variation in bigrams over time in a corpus might indicate changes in the meaning of one of the words. The method combines corpus statistics with manual annotations. The first step consists in measuring the collocational change in a homogeneous newswire corpus with texts from a 14 year time span, 2005 through 2018, by calculating all the statistically significant bigrams. These are then applied to a new version of the corpus that is split into one sub-corpus per year. We then collect all the bigrams that do not appear at all in the first three years, but appear at least 20 times in the following 11 years. The output, a dataset of 745 bigrams considered to be potentially new in Danish, are double annotated, and depending on the annotations and the inter-annotator agreement, either discarded or divided into groups of relevant data for further investigation. We then carry out a more thorough lexicographical study of the bigrams in order to determine the degree to which they support the identification of new senses and lead to revised sense inventories for at least one of the words Furthermore we study the relation between the revisions carried out, the annotation values and the degree of inter-annotator agreement. Finally, we compare the resulting updates of the dictionary with Cook et al. (2013), and discuss whether the method might lead to a more consistent way of revising and updating the dictionary in the future.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.