Enhancing document structure analysis using visual analytics

Stoffel, Andreas; Spretke, David; Kinnemann, Henrik; Keim, Daniel A.

doi:10.1145/1774088.1774091

Cited by 19 publications

(16 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Andreas Stoffel, from the Department of Computer and Information Science, University of Konstanz, Germany, participated with a trainable system [9], [10] for the analysis of PDF documents based on the PDFBox library. After initial column and reading-order detection, logical classification is performed on the line level.…”

Section: E Stoffel's Systemmentioning

confidence: 99%

ICDAR 2013 Table Competition

Göbel

Hassan

Oro

et al. 2013

2013 12th International Conference on Document Analysis and Recognition

203

164

View full text Add to dashboard Cite

Table understanding is a well studied problem in document analysis, and many academic and commercial approaches have been developed to recognize tables in several document formats, including plain text, scanned page images and born-digital, object-based formats such as PDF. Despite the abundance of these techniques, an objective comparison of their performance is still missing. The Table Competition held in the context of ICDAR 2013 is our first attempt at objectively evaluating these techniques against each other in a standardized way, across several input formats. The competition independently addresses three problems: (i) table location, (ii) table structure recognition, and (iii) these two tasks combined. We received results from seven academic systems, which we have also compared against four commercial products. This paper presents our findings.

show abstract

Section: E Stoffel's Systemmentioning

confidence: 99%

ICDAR 2013 Table Competition

Göbel

Hassan

Oro

et al. 2013

2013 12th International Conference on Document Analysis and Recognition

203

164

View full text Add to dashboard Cite

show abstract

“…The Information retrieval system was integrated with a text editor in order to find similar documents. They analyzed the extraction of logical structure from different text document formats [4], [5] and also from source code documents. The second area was the extraction of semantically coherent blocks of text from documents [6].…”

Section: Related Workmentioning

confidence: 99%

A Novel Approach for Document Retrieval System with User Preferences

Kaur¹,

Bhatla²

2014

IJCA

View full text Add to dashboard Cite

This paper proposes a method for Document Retrieval Systems. The document retrieval system finds information to given criteria by matching text record (documents) against user queries. The results generated from information retrieval system must have user preferences. Each user has its own perspectives and cultural context of each word or when the user is searching for highly specific, focussed topic. The probabilistic ranking based on graphic Bayesian statistics is associated with a Kuhn munkres algorithm for it to be really successful to group similar documents. The probabilistic ranking based Kuhn munkres algorithm uses the graphical model such as Bayesian statistics with Bayesian's theorem to find the probability of documents for more relevant results.

show abstract

“…Partitioning scholarly documents comes under a wide research problem know as logical structure extraction (LSE) of semistructured documents, and is not the focus of this work. Fortunately, there are efficient LSE solutions addressed in recent literature (Burget, 2007; Luong et al, 2010; Ratté et al, 2007; Stoffel et al, 2010). In this work, we employ Luong's LSE (Luong et al, 2010) developed by the National University of Singapore (NUS), and available for free use or adaptation within other tools under the Lesser GNU Public License (LGPL) 5…”

Section: Related Workmentioning

confidence: 99%

“…Structural components are subject to interpretation by the reader, but also can be identified automatically using LSE methods (Burget, 2007; Luong et al, 2010; Ratté et al, 2007; Stoffel et al, 2010). In this work, we use SectLabel tool (Luong et al, 2010) to extract the logical structure of scientific publications.…”

Section: Segmentation Of Scientific Publicationsmentioning

confidence: 99%

“…These segments of scholarly documents, commonly referred to in the literature as logical structure, can be extracted (Anjewierden, 2001; Bounhas & Slimani, 2010; Burget, 2007; Councill, Giles, & Kan, 2008; Hagen, Harald, Ngen, & Petra Saskia, 2004; K.H. Lee, Choy, & Cho, 2003; Li & Ng, 2004; Luong, Nguyen, & Kan, 2010; Nguyen & Luong, 2010; Ratté, Njomgue, & Ménard, 2007; Stoffel, Spretke, Kinnemann, & Keim, 2010; Wang, Jin, Wang, Wang, & Gao, 2005; Witt et al, 2010; K. Zhang, Wu, & Li, 2006), and can be used to improve document indexing (Bounhas & Slimani, 2010), to represent the semantic content of scientific publications (Luong, Nguyen, & Kan, 2010; Ratté et al, 2007), to extract key phrases and terminologies (Bounhas & Slimani, 2010; Nguyen & Luong, 2010), and to improve document summarization (Teufel & Moens, 2002). To improve indexing of semistructured documents, for instance, a method of terms weighting is applied according to their structural occurrences (or their positions in different segments of the document), instead of using the whole document as in flat weighting methods (Bounhas & Slimani, 2010; de Moura, Fernandes, Ribeiro‐Neto, da Silva, & Gonçalves, 2010).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Using structural information and citation evidence to detect significant plagiarism cases in scientific publications

Alzahrani

Palade

Salim

et al. 2011

J. Am. Soc. Inf. Sci.

View full text Add to dashboard Cite

In plagiarism detection (PD) systems, two important problems should be considered: the problem of retrieving candidate documents that are globally similar to a document q under investigation, and the problem of side-by-side comparison of q and its candidates to pinpoint plagiarized fragments in detail. In this article, the authors investigate the usage of structural information of scientific publications in both problems, and the consideration of citation evidence in the second problem. Three statistical measures namely Inverse Generic Class Frequency, Spread, and Depth are introduced to assign a degree of importance (i.e., weight) to structural components in scientific articles. A term-weighting scheme is adjusted to incorporate component-weight factors, which is used to improve the retrieval of potential sources of plagiarism. A plagiarism screening process is applied based on a measure of resemblance, in which component-weight factors are exploited to ignore less or nonsignificant plagiarism cases. Using the notion of citation evidence, parts with proper citation evidence are excluded, and remaining cases are suspected and used to calculate the similarity index. The authors compare their approach to two flat-based baselines, TF-IDF weighting with a Cosine coefficient, and shingling with a Jaccard coefficient. In both baselines, they use different comparison units with overlapping measures for plagiarism screening. They conducted extensive experiments using a dataset of 15,412 documents divided into 8,657 source publications and 6,755 suspicious queries, which included 18,147 plagiarism cases inserted automatically. Component-weight factors are assessed using precision, recall, and F -measure averaged over a 10-fold cross-validation and compared using the ANOVA statistical test. Results from structural-based candidate retrieval and plagiarism detection are evaluated statistically against the flat baselines using paired-t tests on 10-fold cross-validation runs, which demonstrate the efficacy achieved by the proposed framework. An empirical study on the system's response shows that structural information, unlike existing plagiarism detectors, helps to flag significant plagiarism cases, improve the similarity index, and provide human-like plagiarism screening results.

show abstract

Enhancing document structure analysis using visual analytics

Cited by 19 publications

References 15 publications

ICDAR 2013 Table Competition

ICDAR 2013 Table Competition

A Novel Approach for Document Retrieval System with User Preferences

Using structural information and citation evidence to detect significant plagiarism cases in scientific publications

Contact Info

Product

Resources

About