One of the primary outputs of the scientific enterprise is data, but many institutions such as libraries that are charged with preserving and disseminating scholarly output have largely ignored this form of documentation of scholarly activity. This paper focuses on a particularly troublesome class of data, termed dark data . “Dark data” is not carefully indexed and stored so it becomes nearly invisible to scientists and other potential users and therefore is more likely to remain underutilized and eventually lost. The article discusses how the concepts from long-tail economics can be used to understand potential solutions for better curation of this data. The paper describes why this data is critical to scientific progress, some of the properties of this data, as well as some social and technical barriers to proper management of this class of data. Many potentially useful institutional, social, and technical solutions are under development and are introduced in the last sections of the paper, but these solutions are largely unproven and require additional research and development.
To automatically convert legacy data of taxonomic descriptions into extensible markup language (XML) format, the authors designed a machine-learning-based approach. In this project three corpora of taxonomic descriptions were selected to prove the hypothesis that domain knowledge and conventions automatically induced from some semistructured corpora (i.e., base corpora) are useful to improve the markup performance of other less-structured, quite different corpora (i.e., evaluation corpora). The "structuredness" of the three corpora was carefully measured. Basing on the structuredness measures, two of the corpora were used as the base corpora and one as the evaluation corpus. Three series of experiments were carried out with the MARTT (markuper of taxonomic treatments) system the authors developed to evaluate the effectiveness of different methods of using the n-gram semantic class association rules, the element relative position probabilities, and a combination of the two types of knowledge mined from the automatically marked-up base corpora. The experimental results showed that the induced knowledge from the base corpora was more reliable than that learned from the training examples alone, and that the n-gram semantic class association rules were effective in improving the markup performance, especially on the elements with sparse training examples. The authors also identify a number of challenges for any automatic markup system using taxonomic descriptions.
In this paper, we explore an approach to make better use of semistructured documents in information retrieval in the domain of biology. Using machine learning techniques, we make those inherent structures explicit by XML markups. This marking up has great potentials in improving task performance in specimen identification and the usability of online flora and fauna.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.