&lt;title&gt;Whole-book recognition using mutual-entropy-driven model adaptation&lt;/title&gt;

Xiu, Pingping; Baird, Henry S.

doi:10.1117/12.767121

Cited by 13 publications

(6 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We discovered that such disagreements are real and can be detected automatically (using cross entropy [58])-further, these disagreements, when summed over long passages (of many pages), correlate significantly with character and word error rates. Thus disagreement, a statistic which the algorithm can estimate, turns out to be a reliable proxy for error rate, which in an unsupervised setting is of course unavailable to the algorithm.…”

Section: B Mutually Correcting Modelsmentioning

confidence: 87%

Document Recognition without Strong Models

Baird

2011

2011 International Conference on Document Analysis and Recognition

Self Cite

View full text Add to dashboard Cite

Abstract-Can a high-performance document image recognition system be built without detailed knowledge of the application? Having benefited from the statistical machinelearning revolution of the last twenty years, our architectures rely less on hand-crafted special-case rules and more on models trained on labeled-sample data sets. But urgent questions remain. When we can't collect (and label) enough real training data, does it help to complement them with data synthesized using generative models? Is it ever completely safe to rely on synthetic data? If we can't manage to train (or craft) a single complete, near-perfect, application-specific "strong" model to drive recognition, can we make progress by combining several imperfect or incomplete "weak" models? Can recognition that is carried out jointly over weak models perform optimally while still running fast? Can a recognizer automatically pick a strong model of its input? Must we always pre-train models for every kind ("style") of input expected, or can a recognizer adapt to unknown styles? Can weak models adapt autonomously, growing stronger and so driving accuracy higher, without any human intervention? Can one model "criticize"-and then proceed to correct-other models, even while it is being criticized and corrected in turn by them? After twenty-five years of research on these questions we have partial answers, many in the affirmative: in addition to promising laboratory demonstrations, we can take pride in successful applications. I'll illustrate the evolution of the state of the art with concrete examples, and point out open problems.(Based on work by and with T. Pavlidis, T.

show abstract

Section: B Mutually Correcting Modelsmentioning

confidence: 87%

Document Recognition without Strong Models

Baird

2011

2011 International Conference on Document Analysis and Recognition

Self Cite

View full text Add to dashboard Cite

show abstract

“…For machines, such analysis can also benefit style based document analysis. One example is the whole book recognition [22]. Our Lehigh notebook dataset provides a natural collection of documents that belong to different notebooks.…”

Section: Logical Clusteringmentioning

confidence: 99%

A real-world noisy unstructured handwritten notebook corpus for document image analysis research

Chen

Lopresti

2011

Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data

View full text Add to dashboard Cite

Traditionally, document image analysis (DIA) is conducted on datasets that are prepared for research purposes. Many existing handwriting datasets, however, do not necessarily represent the range of problems we wish to solve in real life. In this work, we introduce a noisy and unstructured handwriting dataset that aims for promoting and evaluating robust document analysis algorithms for real-world challenges, as a result of emphasizing the process of building and curating a dataset. First, we explain the data acquisition process and characterize its critical features as noisy and unstructured. Then, we discuss a set of real-world scenarios that might benefit from using our notebook dataset. As an ongoing activity, so far we have collected 18 handwritten notebooks from nine college students, resulting in a total of 499 pages. We expect to collect over 100 notebooks, or equivalently about 3,000 pages, from at least 50 students. This dataset is available to the research community via the Lehigh document analysis and exploitation (DAE) platform.

show abstract

“…Previous attempts [2,5] [4] using recognition and verification and [11] which proposes mutual-entropy based model adaptation and demonstrates it on 10 pages. We investigate a different approach which exploits the similarity of word images in a book.…”

Section: Introductionmentioning

confidence: 99%

Robust Recognition of Documents by Fusing Results of Word Clusters

Rasagna¹,

Kumar²,

Jawahar³

et al. 2009

2009 10th International Conference on Document Analysis and Recognition

View full text Add to dashboard Cite

The word error rate of any optical character recognition system (OCR) is usually substantially below its component or character error rate. This is especially true of Indic languages in which a word consists of many components. Current OCRs recognize each character or word separately and do not take advantage of document level constraints. We propose a document level OCR which incorporates information from the entire document to reduce word error rates. Word images are first clustered using a locality sensitive hashing technique. Individual words are then recognized using a (regular) OCR. The OCR outputs of word images in a cluster are then corrected probabilistically by comparing with the OCR outputs of other members of the same cluster. The approach may be applied to improve the accuracy of any OCR run on documents in any language. In particular, we demonstrate it for Telugu, where the use of language models for post-processing is not promising. We show a relative improvement of 28% for long words and 12% for all words which appear at least twice in the corpus.

show abstract

<title>Whole-book recognition using mutual-entropy-driven model adaptation</title>

Cited by 13 publications

References 7 publications

Document Recognition without Strong Models

Document Recognition without Strong Models

A real-world noisy unstructured handwritten notebook corpus for document image analysis research

Robust Recognition of Documents by Fusing Results of Word Clusters

Contact Info

Product

Resources

About