Libraries, private and public, offer valuable resources to library patrons. As of today, the only way to locate information archived exclusively in libraries is through their catalogs. Library patrons, however, often find it difficult to formulate a proper query, which requires using specific keywords assigned to different fields of desired library catalog records, to obtain relevant results. These improperly formulated queries often yield irrelevant results or no results at all. This negative experience in dealing with existing library systems turns library patrons away from directly querying library catalogs; instead, they rely on Web search engines to perform their searches first, and upon obtaining the initial information (e.g., titles, subject headings, or authors) on the desired library materials, they query library catalogs. This searching strategy is an evidence of failure of today's library systems. In solving this problem, we propose an enhanced library system, which allows partial, similarity matching of (a) tags defined by ordinary users at a folksonomy site that describe the content of books and (b) unrestricted keywords specified by an ordinary library patron in a query to search for relevant library catalog records. The proposed library system allows patrons posting a query Q using commonly used words and ranks the retrieved results according to their degrees of resemblance with Q while maintaining the query processing time comparable with that achieved by current library search engines.
As the digitization of historical documents, such as newspapers, becomes more common, the need of the archive patron for accurate digital text from those documents increases. Building on our earlier work, the contributions of this paper are: 1. in demonstrating the applicability of novel methods for correcting optical character recognition (OCR) on disparate data sets, including a new synthetic training set, 2. enhancing the correction algorithm with novel features, and 3. assessing the data requirements of the correction learning method. First, we correct errors using conditional random fields (CRF) trained on synthetic training data sets in order to demonstrate the applicability of the methodology to unrelated test sets. Second, we show the strength of lexical features from the training sets on two unrelated test sets, yielding a relative reduction in word error rate on the test sets of 6.52%. New features capture the recurrence of hypothesis tokens and yield an additional relative reduction in WER of 2.30%. Further, we show that only 2.0% of the full training corpus of over 500,000 feature cases is needed to achieve correction results comparable to those using the entire training corpus, effectively reducing both the complexity of the training process and the learned correction model.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.