International Conference on Electrical, Electronic and Computer Engineering, 2004. ICEEC '04.
DOI: 10.1109/iceec.2004.1374396
|View full text |Cite
|
Sign up to set email alerts
|

More effective, efficient, and scalable web crawler system architecture

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 6 publications
0
2
0
Order By: Relevance
“…The pages are retrieved by the web crawler and follow the link available on that page. It is sent to the parsera major component in the crawling technology, which actually checks whether relevant information is retrieved [5]. The relevant contents are then indexed by the indexer [6] and it is stored for later use.…”
Section: Architecture Of a Crawlermentioning
confidence: 99%
“…The pages are retrieved by the web crawler and follow the link available on that page. It is sent to the parsera major component in the crawling technology, which actually checks whether relevant information is retrieved [5]. The relevant contents are then indexed by the indexer [6] and it is stored for later use.…”
Section: Architecture Of a Crawlermentioning
confidence: 99%
“…The nature of the web is to link multiple resources as hyper-links among them and, following the analogy, the process of reaching an end resource is done by crawling the interconnected nodes. Historically, the search engines have been fed by multiple web crawlers [4,6,9] that automatically track and follow the hyper-links from the content of the web, creating a database of entries that are usually formatted into a human-readable view in order to be presented to humans and to be read by humans. This adds an overhead in the automatic retrieval of content from search engines, as most of the times their results require to be analyzed and parsed from a markup language; in addition, the way to navigate through their content is usually handled dynamically by JavaScript code in form of AJAX calls [5], which requires of a sort of human intervention like scrolling down the content or clicking on certain regions of the view, adding extra layers of complexity to the task of crawling those web sites.…”
Section: Introductionmentioning
confidence: 99%