The Web poses itself as the largest data repository ever available in the history of humankind. Major efforts have been made in order to provide efficient access to relevant information within this huge repository of data. Although several techniques have been developed to the problem of Web data extraction, their use is still not spread, mostly because of the need for high human intervention and the low quality of the extraction results. In this paper, we present a domain-oriented approach to Web data extraction and discuss its application to automatically extracting news from Web sites. Our approach is based on a highly efficient tree structure analysis that produces very effective results. We have tested our approach with several important Brazilian on-line news sites and achieved very precise results, correctly extracting 87.71% of the news in a set of 4088 pages distributed among 35 different sites.
The current boom of the Web is associated with the revenues originated from on-line advertising. While search-based advertising is dominant, the association of ads with a Web page (during user navigation) is becoming increasingly important. In this work, we study the problem of associating ads with a Web page, referred to as content-targeted advertising, from a computer science perspective. We assume that we have access to the text of the Web page, the keywords declared by an advertiser, and a text associated with the advertiser's business. Using no other information and operating in fully automatic fashion, we propose ten strategies for solving the problem and evaluate their effectiveness. Our methods indicate that a matching strategy that takes into account the semantics of the problem (referred to as AAK for "ads and keywords") can yield gains in average precision figures of 60% compared to a trivial vector-based strategy. Further, a more sophisticated impedance coupling strategy, which expands the text of the Web page to reduce vocabulary impedance with regard to an advertisement, can yield extra gains in average precision of 50%. These are first results. They suggest that great accuracy in content-targeted advertising can be attained with appropriate algorithms.
No abstract
In social networks such as Orkut, www.orkut.com, a large portion of the user queries refer to names of other people. Indeed, more than 50% of the queries in Orkut are about names of other users, with an average of 1.8 terms per query. Further, the users usually search for people with whom they maintain relationships in the network. These relationships can be modelled as edges in a friendship graph, a graph in which the nodes represent the users. In this context, search ranking can be modelled as a function that depends on the distances among users in the graph, more specifically, of shortest paths in the friendship graph. However, application of this idea to ranking is not straightforward because the large size of modern social networks (dozens of millions of users) prevents efficient computation of shortest paths at query time. We overcome this by designing a ranking formula that strikes a balance between producing good results and reducing query processing time. Using data from the Orkut social network, which includes over 40 million users, we show that our ranking, augmented by this new signal, produces high quality results, while maintaining query processing time small.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.