The paper describes the approaches taken by the NTNU team to the SemEval 2014 Semantic Textual Similarity shared task. The solutions combine measures based on lexical soft cardinality and character n-gram feature representations with lexical distance metrics from TakeLab's baseline system. The final NTNU system is based on bagged support vector machine regression over the datasets from previous shared tasks and shows highly competitive performance, being the best system on three of the datasets and third best overall (on weighted mean over all six datasets).
The paper describes the improvement of the rule-based Constraint Grammar (CG) Oslo-Bergen Tagger (OBT) by the addition of a statistical module. It is in the nature of CG taggers to leave some words ambiguous between different readings, due to a lack of coverage by the linguistics-based rules. Such ambiguities are often a problem for applications that use the tagger, among them the Norwegian Newspaper Corpus. Our statistical module not only removes part of speech (PoS) and morphological ambiguities, but also disambiguates lemmas. We show how this new system, referred to as OBT+stat, in a straightforward manner combines the strengths of the linguistic knowledge-based CG approach with data-driven methods. The result is a high-performing, fully disambiguating PoS/morphological tagger and lemmatizer with very satisfactory evaluation results.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.