The Web is a huge source of information, and one of the main problems facing users is finding documents which correspond to their requirements. Apart from the problem of thematic relevance, the documents retrieved by search engines do not always meet the users' expectations. The document may be too general, or conversely too specialized, or of a different type from what the user is looking for, and so forth. We think that adding metadata to pages can considerably improve the process of searching for information on the Web. This article presents a possible typology for Web sites and pages, as well as a method for propagating metadata values, based on the study of the Web graph and more specifically the method of cocitation in this graph. IntroductionThe role of the search engines available on the Web is to retrieve in the minimum amount of time the most relevant pages on a given subject. It uses traditional information retrieval system techniques particularly for the representation of documents and queries and for matching systems. The aim is twofold: to find relevant Web pages and then rank them according to relevance. The search engines come up against two major difficulties. The first, which is well known when searching for information using uncontrolled vocabulary as is the case with full-text searching, concerns language-based issues such as synonymy and polysemy, which lead to either noise or silence. The second is directly related to the heterogeneous nature of the Web. In contrast to databases working on homogeneous corpuses of documents, that is, sets of selected documents assembled by the same authority and sharing common properties (collections of scientific articles, patents, etc.), the Web is a forum of free expression that develops in an anarchic manner. It is disorganized and contains totally heterogeneous resources as far as language, subject, level, type, target audience, and the like, are concerned. In such a world, quite apart from the problem of thematic relevance, it is difficult to find resources which correspond to the need (Gravano, 2000). Take the example of a Spanish student and a Spanish researcher, both of whom are looking for information in nuclear physics. The first will look at papers in Spanish at a fairly basic level, while the second will look for scientific articles probably written in English, and possibly also at calls for papers or other documents relating to his or her scientific activity.Along with many, we think that the use of metadata could greatly improve information retrieval on the Web (Marchiori, 1998). We are aware that we cannot count on all resource authors to correctly assign the proper metadata values, because this requires time, skill, and objectivity. To obtain a uniform and systematic description of resources, assigning metadata values should be the work of an information retrieval system done in the same way as documentation professionals carry out cataloging and indexing tasks. Because the manual application of metadata values is very costly, and given the ...
HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L'archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d'enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Given the large heterogeneity of the World Wide Web, using metadata on the search engines side seems to be a useful track for information retrieval. Though, because a manual qualification at the Web scale is not accessible, this track is little followed. We propose a semi-automatic method for propagating metadata. In a first step, homegeneous corpus are extracted. We used in our study the following properties: the authority type, the site type, the information type, and the page type. This first step is realized by a clusterization which uses a similarity measure based on the cocitation frequency between pages. Given the cluster hierarchy, the second step selects a reduced number of documents to be manually qualified and propagates the given metadata values to the other documents belonging to the same cluster. A qualitative evaluation and a preliminary study about the scalability of this method are presented. ContextNone of the available search engines seems to take into account the heterogeneity of the Web resources. All of them are based on a "semantic" representation of the documents, just like the traditionnal information retrieval systems: They stick to solely represent the subject and no other aspects. Though, contrary to the traditionnal document databases, the Web is a non controlled information repository. So the retrieved resources are heterogeneous in many points of view: their subject of course, but their type, their language, their level, their aimed audience, etc. So the users who have many needs and many expectations are not always satisfied by the results sets returned by the search engines.We think that metadata are in the Web as in the library world a mean to address the heterogeneity problem. The HTML standard allows to embed internal metadata in the Web pages with the tag. Though the use of this tag is not much spread because not well known by the authors. On another hand, these metadata are often misused either by a lack of practice or objectivity by honnest authors, or diverted from their intended use to get a better visibility on the Web by those who master them. That's why most of the search engines do not take into account their content in their algorithms. In order to obtain a systematic and uniform qualification of the documents, we think that the metadata should be valued on the search engine side (external metadata). Though a manual qualification of the whole Web seems impossible because the number of documents to qualify would imply the cost to be far too large. So, only automatic or semi-automatic methods are conceivable. Our qualification methodOur semi-automatic method to characterize the Web pages is composed of two steps. In the first step, homogeneous corpus are extracted. This step is fully automatic as it consists in a clusterization method which uses a similarity measure based on the co-citation frequency between pages. Given the cluster hierarchy, the second step selects a reduced number of documents to be manually qualified and propagates the given metadata...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.