Proceedings of the 13th Linguistic Annotation Workshop 2019
DOI: 10.18653/v1/w19-4009
|View full text |Cite
|
Sign up to set email alerts
|

Harmonizing Different Lemmatization Strategies for Building a Knowledge Base of Linguistic Resources for Latin

Abstract: The interoperability between lemmatized corpora of Latin and other resources that use the lemma as indexing key is hampered by the multiple lemmatization strategies that different projects adopt. In this paper we discuss how we tackle the challenges raised by harmonizing different lemmatization criteria in a project that aims to connect linguistic resources for Latin using the Linked Data paradigm. The paper introduces the architecture supporting an open-ended, lemma-based Knowledge Base, built to make textual… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
5
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
2
2
1

Relationship

2
3

Authors

Journals

citations
Cited by 5 publications
(5 citation statements)
references
References 14 publications
0
5
0
Order By: Relevance
“…The fact that OntoLex-Lemon forms are allowed to have multiple written representations is a particularly helpful feature for a language which is attested across circa 25 centuries and in a wide spectrum of genres, and which is, moreover, characterised by a substantial amount of spelling variation. Harmonising different lemmatisation solutions adopted by corpora and NLP tools, however, requires practitioners to deal with other kinds of variation as well [117]. In the case of words with multiple inflectional paradigms or forms which may be interpreted as either autonomous words or inflected forms of a main lemma (such as participles, or adverbs built from adjectives: see e.g.…”
Section: Lila (2018-2023)mentioning
confidence: 99%
See 2 more Smart Citations
“…The fact that OntoLex-Lemon forms are allowed to have multiple written representations is a particularly helpful feature for a language which is attested across circa 25 centuries and in a wide spectrum of genres, and which is, moreover, characterised by a substantial amount of spelling variation. Harmonising different lemmatisation solutions adopted by corpora and NLP tools, however, requires practitioners to deal with other kinds of variation as well [117]. In the case of words with multiple inflectional paradigms or forms which may be interpreted as either autonomous words or inflected forms of a main lemma (such as participles, or adverbs built from adjectives: see e.g.…”
Section: Lila (2018-2023)mentioning
confidence: 99%
“…Accordingly, electronic editions in TEI/XML do not normally qualify as Linked Data, even if they use and provide resolvable URIs (TEI pointers). 117 The annotation of rather than within TEI documents, however, has been pursued by Pelagios/Pleiades, a community interested in the annotation of historical documents and maps with geographical identifiers and other forms of geoinformation (though this does not yet run to linguistic annotations). One result of these efforts is the development of a specialised editor called Recogito, and its extension to TEI/XML.…”
Section: Introduction and Overviewmentioning
confidence: 99%
See 1 more Smart Citation
“…To those treebanks we add also the corpus of the Latin works of Dante Alighieri (13/14th century), distributed as part of the Dante Search project. 24 All four resources include lemmatisation, which we use to connect the corpus tokens to the lemmas in LiLa following the procedure presented in Mambrini and Passarotti (2019). Once that the tokens in the annotated texts are linked to the LiLa lemmas, we use the SPARQL query language to extract information about the derivational morphemes attested in each corpus.…”
Section: Outside Derivational Datamentioning
confidence: 99%
“…Once that the tokens in the annotated texts are linked to the LiLa lemmas, we use the SPARQL query language to extract information about the derivational morphemes attested in each corpus. While some lemmatised resources, like the IT-TB and the works of Dante, are already accessible via a dedicated endpoint provided by LiLa, 25 virtually any other lemmatised corpus can be linked and searched using local files with the methodology described in Mambrini and Passarotti (2019); the results reported here for PROIEL and LLCT were obtained by querying local files.…”
Section: Outside Derivational Datamentioning
confidence: 99%