A Text Extraction Software Benchmark Based on a Synthesized Dataset

Duretec, Kresimir; Rauber, Andreas; Becker, Christoph

doi:10.1109/jcdl.2017.7991565

Cited by 3 publications

(1 citation statement)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Of these three tools, Grobid performed better, which is also evident from our experimental results. The study by Duretec et al [28] presented the evaluation of Tika, DocToText, and Xpdf tools. Among these tools, Tika achieved 58% accuracy in extracting text from PDF documents, in orderly extraction, which is close to our experimental result.…”

Section: Background Studymentioning

confidence: 99%

Sentence Boundary Extraction from Scientific Literature of Electric Double Layer Capacitor Domain: Tools and Techniques

et al. 2022

View full text Add to dashboard Cite

Given the growth of scientific literature on the web, particularly material science, acquiring data precisely from the literature has become more significant. Material information systems, or chemical information systems, play an essential role in discovering data, materials, or synthesis processes using the existing scientific literature. Processing and understanding the natural language of scientific literature is the backbone of these systems, which depend heavily on appropriate textual content. Appropriate textual content means a complete, meaningful sentence from a large chunk of textual content. The process of detecting the beginning and end of a sentence and extracting them as correct sentences is called sentence boundary extraction. The accurate extraction of sentence boundaries from PDF documents is essential for readability and natural language processing. Therefore, this study provides a comparative analysis of different tools for extracting PDF documents into text, which are available as Python libraries or packages and are widely used by the research community. The main objective is to find the most suitable technique among the available techniques that can correctly extract sentences from PDF files as text. The performance of the used techniques Pypdf2, Pdfminer.six, Pymupdf, Pdftotext, Tika, and Grobid is presented in terms of precision, recall, f-1 score, run time, and memory consumption. NLTK, Spacy, and Gensim Natural Language Processing (NLP) tools are used to identify sentence boundaries. Of all the techniques studied, the Grobid PDF extraction package using the NLP tool Spacy achieved the highest f-1 score of 93% and consumed the least amount of memory at 46.13 MegaBytes.

show abstract

Section: Background Studymentioning

confidence: 99%

Sentence Boundary Extraction from Scientific Literature of Electric Double Layer Capacitor Domain: Tools and Techniques

et al. 2022

View full text Add to dashboard Cite

show abstract

An automated Psychometric Analyzer based on Sentiment Analysis and Emotion Recognition for healthcare

Vij

Pruthi

2018

Procedia Computer Science

View full text Add to dashboard Cite

Ситуационно-Ориентированные Базы Данных: Обработка Офисных Документов

Mironov,

Gusarenko,

Yusupova

2022

Моделирование, Оптимизация И Информационные Технологии

View full text Add to dashboard Cite

This article discusses the application of a situation-oriented approach to the problem of extracting semantic information from office documents. Office documents created by vector graphics editors and word processors are reviewed. The ability to extract semantic information is due to the fact that such documents are based on open XML formats that can be processed by external programs. Processing of documents based on a situational database where word documents are programmatically loaded as XML files extracted from zip-archives is considered. In the situation-oriented database, it is possible to present an office document as a virtual document that is mapped both on XML files and the ZIP archive with XML files. This applies not only to text documents, but also to graphic documents that have an internal XML representation. This enables processing of documents in Office Open XML and Open Document Format. The article discusses various aspects of identifying and finding the necessary information during document processing by means of special standard definitions as bookmarks, key phrases and text labels. Models and algorithms for extracting the required information are examined. Examples of the practical use of this approach in the field of distance learning of students at the university are given. In addition, an example of extracting metadata of scientific publications in the Open Journal Systems publishing system is regarded. В статье рассматривается подход построения документоориентированных веб-приложений на основе ситуационно-ориентированных баз данных. Приложения на базе ситуационно-ориентированных баз данных решают проблемы с извлечением и обработкой семантической информации из офисных документов. В уже имеющихся исследованиях рассматривались вопросы заполнения офисных документов, в данном же исследовании рассматриваются методы извлечения информации из графических документов и текстовых документов, созданных в обычных офисных пакетах. Создание и задействование таких методов достигается за счет характера внутреннего представления офисных документов в XML и возможности обработки такого содержимого программным способом. Рассматривается обработка XML-файлов в ситуационно-ориентированных базах данных, где Word-документы программно загружаются как XML-файлы, извлекаемые из ZIP-архивов. В дальнейшем после загрузки документы могут быть представлены как виртуальные документы или множество таких документов, объединенных в виртуальный массив данных и отображаемых на реальные данные XML или ZIP-архивы с XML файлами внутри. Разработанные и применяемые методы работают в отношении как графических, так и текстовых документов. В статье также рассматриваются методы отыскания и идентификации нужных фрагментов данных внутри документа во время его обработки, базирующейся на стандартах описания в закладках, ключевых фразах, и текстовых метках. Модели и алгоритмы для извлечения требующейся информации обсуждаются и демонстрируются на практических примерах, где рассматривается система дистанционного выполнения курсовых проектов студентами. В дополнение к примерам из учебного процесса рассматривается извлечение метаданных научных публикаций из международной издательской системы Open Journal Systems.

show abstract

A Text Extraction Software Benchmark Based on a Synthesized Dataset

Cited by 3 publications

References 21 publications

Sentence Boundary Extraction from Scientific Literature of Electric Double Layer Capacitor Domain: Tools and Techniques

Sentence Boundary Extraction from Scientific Literature of Electric Double Layer Capacitor Domain: Tools and Techniques

An automated Psychometric Analyzer based on Sentiment Analysis and Emotion Recognition for healthcare

Ситуационно-Ориентированные Базы Данных: Обработка Офисных Документов

Contact Info

Product

Resources

About