Given the large heterogeneity of the World Wide Web, using metadata on the search engines side seems to be a useful track for information retrieval. Though, because a manual qualification at the Web scale is not accessible, this track is little followed. We propose a semi-automatic method for propagating metadata. In a first step, homegeneous corpus are extracted. We used in our study the following properties: the authority type, the site type, the information type, and the page type. This first step is realized by a clusterization which uses a similarity measure based on the cocitation frequency between pages. Given the cluster hierarchy, the second step selects a reduced number of documents to be manually qualified and propagates the given metadata values to the other documents belonging to the same cluster. A qualitative evaluation and a preliminary study about the scalability of this method are presented.
ContextNone of the available search engines seems to take into account the heterogeneity of the Web resources. All of them are based on a "semantic" representation of the documents, just like the traditionnal information retrieval systems: They stick to solely represent the subject and no other aspects. Though, contrary to the traditionnal document databases, the Web is a non controlled information repository. So the retrieved resources are heterogeneous in many points of view: their subject of course, but their type, their language, their level, their aimed audience, etc. So the users who have many needs and many expectations are not always satisfied by the results sets returned by the search engines.We think that metadata are in the Web as in the library world a mean to address the heterogeneity problem. The HTML standard allows to embed internal metadata in the Web pages with the tag. Though the use of this tag is not much spread because not well known by the authors. On another hand, these metadata are often misused either by a lack of practice or objectivity by honnest authors, or diverted from their intended use to get a better visibility on the Web by those who master them. That's why most of the search engines do not take into account their content in their algorithms. In order to obtain a systematic and uniform qualification of the documents, we think that the metadata should be valued on the search engine side (external metadata). Though a manual qualification of the whole Web seems impossible because the number of documents to qualify would imply the cost to be far too large. So, only automatic or semi-automatic methods are conceivable.
Our qualification methodOur semi-automatic method to characterize the Web pages is composed of two steps. In the first step, homogeneous corpus are extracted. This step is fully automatic as it consists in a clusterization method which uses a similarity measure based on the co-citation frequency between pages. Given the cluster hierarchy, the second step selects a reduced number of documents to be manually qualified and propagates the given metadata...