Survey of Web Crawling Algorithms

Kumar, Rahul; Jain, Anurag; Agrawal, Chetan

doi:10.5121/avc.2016.3301

Cited by 4 publications

(5 citation statements)

References 17 publications

(14 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Through the mining of web crawler algorithms, various possibilities are verified, including breadth-first (search the neighbors at the same level), depth-first (traverse to the bottom from the root node), URL ordering (queue), page-rank (importance based on the number of backlinks or citations), online page importance (importance of a page in a website), largest sites first (websites with the largest number of pages), page request—HTTP or the dynamic, customized site map (applicable to deal with updates on already visited pages), and filtering (query-based approach) [ 7 , 80 , 81 ]. In some of these algorithms, keywords are accepted as the search query, and all relevant URLs fulfilling that search query are returned.…”

Section: Discussionmentioning

confidence: 99%

An Automated Customizable Live Web Crawler for Curation of Comparative Pharmacokinetic Data: An Intelligent Compilation of Research-Based Comprehensive Article Repository

et al. 2023

View full text Add to dashboard Cite

Data curation has significant research implications irrespective of application areas. As most curated studies rely on databases for data extraction, the availability of data resources is extremely important. Taking a perspective from pharmacology, extracted data contribute to improved drug treatment outcomes and well-being but with some challenges. Considering available pharmacology literature, it is necessary to review articles and other scientific documents carefully. A typical method of accessing articles on journal websites is through long-established searches. In addition to being labor-intensive, this conventional approach often leads to incomplete-content downloads. This paper presents a new methodology with user-friendly models to accept search keywords according to the investigators’ research fields for metadata and full-text articles. To accomplish this, scientifically published records on the pharmacokinetics of drugs were extracted from several sources using our navigating tool called the Web Crawler for Pharmacokinetics (WCPK). The results of metadata extraction provided 74,867 publications for four drug classes. Full-text extractions performed with WCPK revealed that the system is highly competent, extracting over 97% of records. This model helps establish keyword-based article repositories, contributing to comprehensive databases for article curation projects. This paper also explains the procedures adopted to build the proposed customizable-live WCPK, from system design and development to deployment phases.

show abstract

Section: Discussionmentioning

confidence: 99%

An Automated Customizable Live Web Crawler for Curation of Comparative Pharmacokinetic Data: An Intelligent Compilation of Research-Based Comprehensive Article Repository

et al. 2023

View full text Add to dashboard Cite

show abstract

“…The process of getting information from web pages can be done through web crawling processes and through the Really Simple Syndication (RSS) format. Some web crawling methods such as By HTTP Get Request and Dynamic Web Page and By the use of filters are the most preferred methods [4]. In addition, the RSS format is a form of content syndication from Extensible Markup Language (XML) based websites that can also be used [5].…”

Section: Theoritical Basis and Related Workmentioning

confidence: 99%

“…The application of the Naïve Bayes classification method is done by applying the Bayes theorem which is formulated by equation (4).…”

Section: Training Using the Svm Knn And Naive Bayesmentioning

confidence: 99%

Classification of Traffic Accident Information Using Machine Learning from Social Media

Saputro¹

2020

IJETER

View full text Add to dashboard Cite

With the increasing number of accidents in Indonesia, analysis of accident data is still need to be considered and analyzed. Moreover, traffic accident information from social media such as Twitter is easy to obtain when compared to other data source from police or government institutions. In this work, we tried to create a traffic accident dataset. We crawling on Twitter, then do a text processing which involves case folding, filtering, stemming, tokenizing, and stop word removal. To find out whether the tweet is true about accident information, then we do a word embedding using FastText method and classification with SVM, KNN, and Naïve Bayes algorithms then testing the accuracy of models using K-fold validation method. The results show that the best accuracy of the model is up to 85% with SVM method. This model is then applied to a tweet dataset containing 142,168 tweets taken through the crawling process since April 2019 and can be used for further research on https://www.dodyagung.com/dataset/accident.

show abstract

“…In this work we focus on tasks (b) and (c). To understand the detailed working of crawlers, see [2,3,4,5,6].…”

Section: Introductionmentioning

confidence: 99%

Online Algorithms for Estimating Change Rates of Web Pages

Avrachenkov¹,

Patil²,

Thoppe³

2020

Preprint

View full text Add to dashboard Cite

For providing quick and accurate search results, a search engine maintains a local snapshot of the entire web. And, to keep this local cache fresh, it employs a crawler for tracking changes across various web pages. It would have been ideal if the crawler managed to update the local snapshot as soon as a page changed on the web. However, finite bandwidth availability and server restrictions mean that there is a bound on how frequently the different pages can be crawled. This then brings forth the following optimisation problem: maximise the freshness of the local cache subject to the crawling frequency being within the prescribed bounds.Recently, tractable algorithms have been proposed to solve this optimisation problem under different cost criteria. However, these assume the knowledge of exact page change rates, which is unrealistic in practice. We address this issue here. Specifically, we provide three novel schemes for online estimation of page change rates. All these schemes only need partial information about the page change process, i.e., they only need to know if the page has changed or not since the last crawl instance. Our first scheme is based on the law of large numbers, the second on the theory of stochastic approximation, while the third is an extension of the second and involves an additional momentum term. For all of these schemes, we prove convergence and, also, provide their convergence rates. As far as we know, the results concerning the third estimator is quite novel. Specifically, this is the first convergence type result for a stochastic approximation algorithm with momentum. Finally, we provide some numerical experiments (on real as well as synthetic data) to compare the performance of our proposed estimators with the existing ones (e.g., MLE).

show abstract

Survey of Web Crawling Algorithms

Abstract: ABSTRACT

Cited by 4 publications

References 17 publications

An Automated Customizable Live Web Crawler for Curation of Comparative Pharmacokinetic Data: An Intelligent Compilation of Research-Based Comprehensive Article Repository

An Automated Customizable Live Web Crawler for Curation of Comparative Pharmacokinetic Data: An Intelligent Compilation of Research-Based Comprehensive Article Repository

Classification of Traffic Accident Information Using Machine Learning from Social Media

Online Algorithms for Estimating Change Rates of Web Pages

Contact Info

Product

Resources

About