Xi S. Guo scite author profile

This paper describes using RDF/RDFS/XML to create and navigate a metadata model of relationships among entities in text. The metadata we create is roughly an order of magnitude smaller than the content being modeled, it provides the end-user with context sensitive information about the hyperlinked entities in focus. These entities at the core of the model are originally found and resolved using a combination of information extraction and record linkage techniques. The RDF/RDFS metadata model is then used to "look ahead" and navigate to related information. An RDF aware frontend web application streamlines the presentation of information to the end user.

show abstract

Online duplicate document detection

Conrad

Guo

Schriber

2003

View full text Add to dashboard Cite

As online document collections continue to expand, both on the Web and in proprietary environments, the need for duplicate detection becomes more critical. Few users wish to retrieve search results consisting of sets of duplicate documents, whether identical duplicates or close matches. Our goal in this work is to investigate the phenomenon and determine one or more approaches that minimize its impact on search results. Recent work has focused on using some form of signature to characterize a document in order to reduce the complexity of document comparisons. A representative technique constructs a 'fingerprint' of the rarest or richest features in a document using collection statistics as criteria for feature selection. One of the challenges of this approach, however, arises from the fact that in production environments, collections of documents are always changing, with new documents, or new versions of documents, arriving frequently, and other documents periodically removed. When an enterprise proceeds to freeze a training collection in order to stabilize the underlying repository of such features and its associated collection statistics, issues of coverage and completeness arise. We show that even with very large training collections possessing extremely high feature correlations before and after updates, underlying fingerprints remain sensitive to subtle changes. We explore alternative solutions that benefit from the development of massive meta-collections made up of sizable components from multiple domains. This technique appears to offer a practical foundation for fingerprint stability. We also consider mechanisms for updating training collections while mitigating signature instability.Our research is divided into three parts. We begin with a study of the distribution of duplicate types in two broadranging news collections consisting of approximately 50 million documents. We then examine the utility of document signatures in addressing identical or nearly identical duplicate documents and their sensitivity to collection updates. Finally, we investigate a flexible method of characterizing and comparing documents in order to permit the identification of non-identical duplicates. This method has produced promising results following an extensive evaluation using a production-based test collection created by domain experts.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Xi S. Guo

Database Selection Using Actual Physical and Acquired Logical Collection Resources in a Massive Domain-specific Operational Environment

Creation of an expert witness database through text mining

Fast tagging of medical terms in legal text

A web application using RDF/RDFS for metadata navigation

Online duplicate document detection

Contact Info

Product

Resources

About