2015
DOI: 10.1007/s10032-015-0249-8
|View full text |Cite
|
Sign up to set email alerts
|

CERMINE: automatic extraction of structured metadata from scientific literature

Abstract: CERMINE is a comprehensive open-source system for extracting structured metadata from scientific articles in a born-digital form. The system is based on a modular workflow, whose loosely coupled architecture allows for individual component evaluation and adjustment, enables effortless improvements and replacements of independent parts of the algorithm and facilitates future architecture expanding. The implementations of most steps are based on supervised and unsupervised machine learning techniques, which simp… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
136
0
4

Year Published

2015
2015
2021
2021

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 150 publications
(144 citation statements)
references
References 26 publications
0
136
0
4
Order By: Relevance
“…For this study, we processed documents in two formats: PDF (confirmed cases of AP) and LaTeX source code (NTCIR-11 MathIR Task dataset). We used GROBID to obtain bibliographic references from documents in both formats because the tool achieved excellent results for extracting header metadata, citations, and references [4,42]. Since GROBID cannot recognize mathematical formulae, we semi-automatically invoked InftyReader 6 to convert the PDFs for confirmed cases of AP to an intermediate LaTeX format.…”
Section: Preprocessingmentioning
confidence: 99%
“…For this study, we processed documents in two formats: PDF (confirmed cases of AP) and LaTeX source code (NTCIR-11 MathIR Task dataset). We used GROBID to obtain bibliographic references from documents in both formats because the tool achieved excellent results for extracting header metadata, citations, and references [4,42]. Since GROBID cannot recognize mathematical formulae, we semi-automatically invoked InftyReader 6 to convert the PDFs for confirmed cases of AP to an intermediate LaTeX format.…”
Section: Preprocessingmentioning
confidence: 99%
“…In a recent survey and evaluation, several non-commercial reference parsing tools, Tkaczyk et al (2018) found that the best three performing ones all use a CRF approach: GROBID (Lopez, 2009), CERMINE (Tkaczyk et al, 2015) and ParsCit (Councill et al, 2008). All three benefit from task-specific tuning using extra annotated data, with GROBID showing the best off-the-shelf results.…”
Section: Related Workmentioning
confidence: 99%
“…The large majority of full-texts collected by the system are PDF files, a for-mat well suited for printing and human reading, but less tractable by machines. For this reason, the full-text collection workflow includes a final phase designed to automatically extract structured metadata from such PDF files using CERMINE [2]. The extracted fulltexts are then stored in dedicated caches that are accessible by the OpenAIRE Information Inference System.…”
Section: Full-text Aggregationmentioning
confidence: 99%