2017
DOI: 10.5626/jcse.2017.11.2.39
|View full text |Cite
|
Sign up to set email alerts
|

Main Content Extraction from Web Pages Based on Node Characteristics

Abstract: Main content extraction of web pages is widely used in search engines, web content aggregation and mobile Internet browsing. However, a mass of irrelevant information such as advertisement, irrelevant navigation and trash information is included in web pages. Such irrelevant information reduces the efficiency of web content processing in content-based applications. The purpose of this paper is to propose an automatic main content extraction method of web pages. In this method, we use two indicators to describe… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
6
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 13 publications
(6 citation statements)
references
References 14 publications
0
6
0
Order By: Relevance
“…There are a lot of different website layouts 4 that define a website's structure. Relevant and irrelevant images are included in these layouts.…”
Section: Motivationmentioning
confidence: 99%
See 1 more Smart Citation
“…There are a lot of different website layouts 4 that define a website's structure. Relevant and irrelevant images are included in these layouts.…”
Section: Motivationmentioning
confidence: 99%
“…This structure is called an RDF is a W3C standard for data interchange on the Web. For more information, see https://www.w3.org/RDF/ 4 Web layouts consist of patterns that rule the structure of the document. For more examples about layouts, see https://www.w3schools.com/css/css_website_layout.asp and https://www.w3schools.com/css/css_templates.asp.…”
Section: Motivationmentioning
confidence: 99%
“…Once extracted, this news can be adapted to support different tasks; news extraction can be applied to generate news highlight sentences that capture the main topic within a news article [4]. Also, the fast and effective extraction of content from webpages could be used to adapt webpages for small screen devices [5] and as proclaimed [6], if we can extract the relevant content of a webpage rapidly, many semantic applications such as search engines can be developed by leveraging this.…”
Section: Introductionmentioning
confidence: 99%
“…Obtaining these layouts has become very crucial for text processing applications such as search engines, sentiment analysis, recommendation systems, trend detection/monitoring, and e-commerce market monitoring. Many studies [1][2][3][4] in the literature focus on extracting parts of the title, summary, main text from web pages automatically. Some studies [5,6] are about obtaining the review part automatically.…”
Section: Introductionmentioning
confidence: 99%
“…CSS selectors or XML Path Language (XPath) are both capable of finding element/s containing data on this tree for the extraction process. Uzun et al [8] compare three different well-known .NET parsers, including HAP 1 , AngleSharp 2 and Microsoft (MS) HTMLDocument 3 to extract data from web pages. They use XPath patterns for this task.…”
Section: Introductionmentioning
confidence: 99%