An Agent-Based Focused Crawling Framework for Topic- and Genre-Related Web Document Discovery

Pappas, Nikolaos; Katsimpras, Georgios; Stamatatos, Efstathios

doi:10.1109/ictai.2012.75

Cited by 7 publications

(7 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The measurement of execution time that is calculated as the time passed from starting of execution until the agents reach the predefined threshold of crawled webpages. 162 If there is no way to measure the actual number of webpages available, the total number of webpages collected by each crawler is used as metric. 172 Precision (Relevance) is judged by the human inspection which is biased and inconsistent.…”

Section: Performance Metrics For Focused Web Crawlermentioning

confidence: 99%

A survey of Web crawlers for information retrieval

Kumar

Bhatia

Rattan

2017

WIREs Data Min & Knowl

View full text Add to dashboard Cite

Performance of any search engine relies heavily on its Web crawler. Web crawlers are the programs that get webpages from the Web by following hyperlinks. These webpages are indexed by a search engine and can be retrieved by a user query. In the area of Web crawling, we still lack an exhaustive study that covers all crawling techniques. This study follows the guidelines of systematic literature review and applies it to the field of Web crawling. We used the standard procedure of carrying out a systematic literature review on 248 studies from a total of 1488 articles published in 12 leading journals and other premier conferences and workshops. Existing literature about the Web crawler is classified into different key subareas. Each subarea is further divided according to the techniques being used. We analyzed the distribution of various articles using multiple criteria and depicted conclusions. Various studies that use open source Web crawlers are also reported. We have highlighted future areas of research. We call for an increased awareness in various fields of the Web crawler and identify how techniques from other domains can be used for crawling the Web. Limitations and recommendations for future are also discussed. WIREs Data Mining Knowl Discov 2017, 7:e1218. doi: 10.1002/widm.1218 This article is categorized under: Algorithmic Development > Web Mining Fundamental Concepts of Data and Knowledge > Information Repositories Fundamental Concepts of Data and Knowledge > Motivation and Emergence of Data Mining

show abstract

Section: Performance Metrics For Focused Web Crawlermentioning

confidence: 99%

A survey of Web crawlers for information retrieval

Kumar

Bhatia

Rattan

2017

WIREs Data Min & Knowl

View full text Add to dashboard Cite

show abstract

“…Some of the other work on seed URL extraction and topic mapping are (i) Pappas et al [19] identified topics using dynamic seed URLs and evaluated topic relevance. In this work, the identification of seed URLs is manual and does not confirm representation of all subtopics of a topic.…”

Section: Literature Surveymentioning

confidence: 99%

Fine Grained Approach for Domain Specific Seed URL Extraction

Sanagavarapu¹,

Sarangi²,

Y³

et al. 2018

Proceedings of the 51st Hawaii International Conference on System Sciences

View full text Add to dashboard Cite

Domain Specific Search Engines are expected to provide relevant search results. Availability of enormous number of URLs across subdomains improves relevance of domain specific search engines. The current methods for seed URLs can be systematic ensuring representation of subdomains. We propose a fine grained approach for automatic extraction of seed URLs at subdomain level using Wikipedia and Twitter as repositories. A SeedRel metric and a Diversity Index for seed URL relevance are proposed to measure subdomain coverage. We implemented our approach for 'Security-Information and Cyber' domain and identified 34,007 Seed URLs and 400,726 URLs across subdomains. The measured Diversity index value of 2.10 conforms that all subdomains are represented, hence, a relevant 'Security Search Engine' can be built. Our approach also extracted more URLs (seed and child) as compared to existing approaches for URL extraction.

show abstract

“…The idea is that, given a query, up-to-date relevant documents can be retrieved from various domains and web-genres by following the path of a focused crawler, but also in a real-time manner. For the purposes of our system, [13] is especially suitable. It is an agent-based focused crawling framework that is able to retrieve topic-and genre-related web documents in an automated and real-time manner.…”

Section: Discovery Of Topic-related Web Documentsmentioning

confidence: 99%

“…The Linkscore T and Linkscore G are relevance scores based on topic and genre accordingly; and they are computed by using link analysis techniques (see [13]). …”

Section: Discovery Of Topic-related Web Documentsmentioning

confidence: 99%

Distinguishing the Popularity between Topics: A System for Up-to-Date Opinion Retrieval and Mining in the Web

Pappas

Katsimpras

Stamatatos

2013

Computational Linguistics and Intelligent Text Processing

View full text Add to dashboard Cite

Abstract. The constantly increasing amount of opinionated texts found in the Web had a significant impact in the development of sentiment analysis. So far, the majority of the comparative studies in this field focus on analyzing fixed (offline) collections from certain domains, genres, or topics. In this paper, we present an online system for opinion mining and retrieval that is able to discover up-to-date web pages on given topics using focused crawling agents, extract opinionated textual parts from web pages, and estimate their polarity using opinion mining agents. The evaluation of the system on real-world case studies, demonstrates that is appropriate for opinion comparison between topics, since it provides useful indications on the popularity based on a relatively small amount of web pages. Moreover, it can produce genre-aware results of opinion retrieval, a valuable option for decision-makers.

show abstract

An Agent-Based Focused Crawling Framework for Topic- and Genre-Related Web Document Discovery

Cited by 7 publications

References 17 publications

A survey of Web crawlers for information retrieval

A survey of Web crawlers for information retrieval

Fine Grained Approach for Domain Specific Seed URL Extraction

Distinguishing the Popularity between Topics: A System for Up-to-Date Opinion Retrieval and Mining in the Web

Contact Info

Product

Resources

About