We present the first open-source tool for annotating morphosyntactic tense, mood and voice for English, French and German verbal complexes. The annotation is based on a set of language-specific rules, which are applied on dependency trees and leverage information about lemmas, morphological properties and POS-tags of the verbs. Our tool has an average accuracy of about 76%. The tense, mood and voice features are useful both as features in computational modeling and for corpuslinguistic research.
IntroductionNatural language employs, among other devices such as temporal adverbials, tense and aspect to locate situations in time and to describe their temporal structure (Deo, 2012). The tool presented here addresses the automatic annotation of morphosyntactic tense, i.e., the tense-aspect combinations, expressed in the morphology and syntax of verbal complexes (VC). VCs are sequences of verbal tokens within a verbal phrase. We address German, French and English, in which the morphology and syntax also includes information on mood and voice. Morphosyntactic tenses do not always correspond to semantic tense (Deo, 2012). For example, the morphosyntactic tense of the English sentence "He is leaving at noon." is present progressive, while the semantic tense is future. In the remainder of this paper, we use the term tense to refer to the morphological tense and aspect information encoded in finite verbal complexes.Corpus-linguistic research, as well as automatic modeling of mono-and cross-lingual use of tense, mood and voice will strongly profit from a reliable automatic method for identifying these clausal features. They may, for instance, be used to classify texts with respect to the epoch or region in which they have been produced, or for assigning texts to a specific author. Moreover, in crosslingual research, tense, mood, and voice have been used to model the translation of tense between different language pairs (Santos, 2004; Loáiciga et al., 2014; Ramm and Fraser, 2016)). Identifying the morphosyntactic tense is also a necessary prerequisite for identifying the semantic tense in synthetic languages such as English, French or German (Reichart and Rappoport, 2010). The extracted tense-mood-voice (TMV) features may also be useful for training models in computational linguistics, e.g., for modeling of temporal relations (Costa and Branco, 2012; UzZaman et al., 2013).As illustrated by the examples in Figure 1, relevant information for determining TMV is given by syntactic dependencies and partially by partof-speech (POS) tags output by analyzers such as Mate (Bohnet and Nivre, 2012). However, the parser's output is not sufficient for determining TMV features; morphological features and lexical information needs to be taken into account as well. Learning TMV features from an annotated corpus would be an alternative; however, to the best of our knowledge, no such large-scale corpora exist.A sentence may contain more than one VC, and the tokens belonging to a VC are not always contiguous in the sentence (see VCs A...