2016
DOI: 10.1007/s10579-016-9359-2
|View full text |Cite
|
Sign up to set email alerts
|

Accurate and efficient general-purpose boilerplate detection for crawled web corpora

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
22
0
1

Year Published

2016
2016
2023
2023

Publication Types

Select...
6
2
1

Relationship

1
8

Authors

Journals

citations
Cited by 32 publications
(23 citation statements)
references
References 10 publications
0
22
0
1
Order By: Relevance
“…A brief discussion of the benefits for BI of our approach as well as a sketch of the algorithm is presented in section 3.1. Differently from approaches like that in [30] our approach is not technology-specific and, thus, it is not supposed to become deprecated with the evolution of web technologies. Moreover, integrating our approach with our crawling ordering described in section 2.2, it is possible to reduce the storage requirements of crawling still enabling to restore the original webpages.…”
Section: Filteringmentioning
confidence: 97%
“…A brief discussion of the benefits for BI of our approach as well as a sketch of the algorithm is presented in section 3.1. Differently from approaches like that in [30] our approach is not technology-specific and, thus, it is not supposed to become deprecated with the evolution of web technologies. Moreover, integrating our approach with our crawling ordering described in section 2.2, it is possible to reduce the storage requirements of crawling still enabling to restore the original webpages.…”
Section: Filteringmentioning
confidence: 97%
“…The system uses a variety of text functions, such as the ratio of text to label, the ratio of anchor text to text, and the density of title, keywords. In addition, the noise element detection of MLP (Multi-Layer Perceptron) is also carried out [27]. The results show that the linguistic features of sentence length play a key role in noise classification.…”
Section: B Page Noise Reductionmentioning
confidence: 99%
“…The post-processing modules are re-used from the previously developed texrex software (Schäfer and Bildhauer, 2012;Schäfer, 2016b;Schäfer, 2016a). The walker documents the progression of the RW in a short and a long file format.…”
Section: Built-in Processing and Output Formatsmentioning
confidence: 99%