2006
DOI: 10.1007/11669487_38
|View full text |Cite
|
Sign up to set email alerts
|

Digitizing a Million Books: Challenges for Document Analysis

Abstract: This paper describes the challenges for document image analysis community for building large digital libraries with diverse document categories. The challenges are identified from the experience of the ongoing activities toward digitizing and archiving one million books. Smooth workflow has been established for archiving large quantity of books, with the help of efficient image processing algorithms. However, much more research is needed to address the challenges arising out of the diversity of the content in … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
18
0

Year Published

2006
2006
2021
2021

Publication Types

Select...
5
3

Relationship

2
6

Authors

Journals

citations
Cited by 30 publications
(18 citation statements)
references
References 6 publications
0
18
0
Order By: Relevance
“…The consequent shift from paper to digital support for documents provided new opportunities for their management and processing activities, solving the problems of duplication and sharing that seriously affected legacy (paper) documents. However, making document production easy and cheap, it also introduced new problems, consisting in a huge amount of available documents (and in an associated decrease of their content quality) (Sankar, 2006). The wellknown information overload problem, indeed, consists of the users' difficulty in accessing interesting information in large and heterogeneous repositories.…”
Section: The Document Processing Domainmentioning
confidence: 99%
“…The consequent shift from paper to digital support for documents provided new opportunities for their management and processing activities, solving the problems of duplication and sharing that seriously affected legacy (paper) documents. However, making document production easy and cheap, it also introduced new problems, consisting in a huge amount of available documents (and in an associated decrease of their content quality) (Sankar, 2006). The wellknown information overload problem, indeed, consists of the users' difficulty in accessing interesting information in large and heterogeneous repositories.…”
Section: The Document Processing Domainmentioning
confidence: 99%
“…Scanned document images contain a large number of artifacts, which are cleaned on a large scale using a a semi-automatic process [3], by using various image processing operations. Owing to the variation in quality across the images, a single setup of image processing parameters would not be suitable for all.…”
Section: Issues In Scanningmentioning
confidence: 99%
“…These include the Universal Digital Library (UDL) [1], Digital Library of India (DLI) [2], Google Books, etc. [3]. Much effort is being put into the digitisation of massive quantities of documents.…”
Section: Introductionmentioning
confidence: 99%
“…These include Google Books, the Universal Digital Library (UDL) [1] and the Digital Library of India (DLI) [2,25]. A large percentage of these collections are in non-Latin scripts.…”
Section: Introductionmentioning
confidence: 99%