CERMINE is a comprehensive open-source system for extracting structured metadata from scientific articles in a born-digital form. The system is based on a modular workflow, whose loosely coupled architecture allows for individual component evaluation and adjustment, enables effortless improvements and replacements of independent parts of the algorithm and facilitates future architecture expanding. The implementations of most steps are based on supervised and unsupervised machine learning techniques, which simplifies the procedure of adapting the system to new document layouts and styles. The evaluation of the extraction workflow carried out with the use of a large dataset showed good performance for most metadata types, with the average F score of 77.5 %. CERMINE system is available under an open-source licence and can be accessed at http:// cermine.ceon.pl. In this paper, we outline the overall workflow architecture and provide details about individual steps implementations. We also thoroughly compare CERMINE to similar solutions, describe evaluation methodology and finally report its results.
B Dominika Tkaczyk
Directory of open access repositories:http://www.opendoar.org
Registry of open access repositories: http://roar.eprints.orgRegistry of open access repository material archiving policies:
We present a comprehensive system for extracting metadata from scholarly articles. In our approach the entire document is inspected, including headers and footers of all the pages as well as bibliographic references. The system is based on a modular workflow which allows for evaluation, unit testing and replacement of individual components. The workflow is optimized towards processing of born-digital documents, but may accept scanned document images as well. The machinelearning approaches we have chosen for solving individual tasks increase the ability to adapt to new document layouts and formats. The evaluation tests we have performed showed good results of the individual implementations and the entire metadata extraction process.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.