Proceedings of the 24th International Conference on World Wide Web 2015
DOI: 10.1145/2736277.2741659
|View full text |Cite
|
Sign up to set email alerts
|

Automatic Web Content Extraction by Combination of Learning and Grouping

Abstract: Web pages consist of not only actual content, but also other elements such as branding banners, navigational elements, advertisements, copyright etc. This noisy content is typically not related to the main subjects of the webpages. Identifying the part of actual content, or clipping web pages, has many applications, such as high quality web printing, e-reading on mobile devices and data mining. Although there are many existing methods attempting to address this task, most of them can either work only on certai… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
13
0

Year Published

2016
2016
2020
2020

Publication Types

Select...
6
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 29 publications
(13 citation statements)
references
References 23 publications
0
13
0
Order By: Relevance
“…Next, create the feature Dom tree is created, and finally ,use the feature Dom tree for noise detection. Enhanced Dom tree and context features are also tested in [37] [38] [39] for noise detection.…”
Section: B Page Noise Reductionmentioning
confidence: 99%
“…Next, create the feature Dom tree is created, and finally ,use the feature Dom tree for noise detection. Enhanced Dom tree and context features are also tested in [37] [38] [39] for noise detection.…”
Section: B Page Noise Reductionmentioning
confidence: 99%
“…Wu et al [47] proposed a machine learning model using DOM tree node features such as position, area, font, text and tag properties to select and group content related nodes and their children. In their recent paper, Vogels et al [42] presented an algorithm combining a hidden markov model in and a convolutional neural networks (CNNs).…”
Section: Related Work 61 Content Extractionmentioning
confidence: 99%
“…Web content extraction is very well investigated in the literature [1,2,3,4,5,6,7,8,9,10,11,12]. Many of these approaches apply techniques based on certain heuristics, machine learning or site specific solutions like rule based content extraction, DOM tree parsing, Text graph or Link Graph or vision based models, or NLP features like N −grams or shallow text features like number of tokens, average sentence length and so on.…”
Section: Related Workmentioning
confidence: 99%
“…In this work, a set of relevant features is selected for each text block in the HTML document and then using a Support Vector Machine (SVM) classifier, each text block is classified as either content block or non-content block. Most recently, Wu et al [12] formulated the content identification problem as a DOM tree node selection problem. Using multiple features from DOM node properties, a machine learning model is trained and a set of candidate nodes is selected based on the learning model.…”
Section: Related Workmentioning
confidence: 99%