We present a comprehensive analysis of approaches for discovering links in document collections. We classify link discovery approaches with respect to the type of knowledge being used: the text of a document, its title, and already existing links. Using an evaluation dataset derived from Wikipedia, we find that link-based approaches outperform all other approaches if they can draw knowledge from a very large amount of already existing links. Simulating other document collections with fewer links, we show that text-based approaches yield better results. Furthermore, we argue that knowledge from Wikipedia cannot necessarily be applied to other domains, e.g. in corporate intranets. Thus, we conclude that text-based approaches are the best choice for reliable link discovery in arbitrary document collections.
DKPro Keyphrases is a keyphrase extraction framework based on UIMA. It offers a wide range of state-of-the-art keyphrase experiments approaches. At the same time, it is a workbench for developing new extraction approaches and evaluating their impact. DKPro Keyphrases is publicly available under an open-source license. 1 33 Dimension.create("evalType", EvaluatorType.Lemma), 34); 35 36 Task preprocessingTask = new PreprocessingTask(); 37 Task filteringTask = new KeyphraseFilteringTask(); 38 candidateSelectionTask.addImport(preprocessingTask, PreprocessingTask.OUTPUT, KeyphraseFilteringTask.INPUT); 39 Task keyphraseRankingTask = new KeyphraseRankingTask(); 40 keyphraseRankingTask.addImport(filteringTask, KeyphraseFilteringTask.OUTPUT, KeyphraseRankingTask.INPUT); 41 42 BatchTask batch = new BatchTask(); 43 batch.setParameterSpace(params); 44 batch.addTask(preprocessingTask); 45 batch.addTask(candidateSelectionTask); 46 batch.addTask(keyphraseRankingTask); 47 batch.addReport(KeyphraseExtractionReport.class); 48 Lab.getInstance().run(batch);
A core assumption of keyphrase extraction is that a concept is more important if it is mentioned more often in a document. Especially in languages like German that form large noun compounds, frequency counts might be misleading as concepts "hidden" in compounds are not counted. We hypothesize that using decompounding before counting term frequencies may lead to better keyphrase extraction. We identified two effects of decompounding: (i) enhanced frequency counts, and (ii) more keyphrase candidates. We created two German evaluation datasets to test our hypothesis and analyzed the effect of additional decompounding for keyphrase extraction.
In this paper, we investigate the difference between word and sense similarity measures and present means to convert a state-of-the-art word similarity measure into a sense similarity measure. In order to evaluate the new measure, we create a special sense similarity dataset and re-rate an existing word similarity dataset using two different sense inventories from WordNet and Wikipedia. We discover that word-level measures were not able to differentiate between different senses of one word, while sense-level measures actually increase correlation when shifting to sense similarities. Sense-level similarity measures improve when evaluated with a re-rated sense-aware gold standard, while correlation with word-level similarity measures decreases.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.