Main Content Extraction from Web Pages Based on Node Characteristics

Liu, Qingtang; Shao, Mingwang; Wu, Linjing; Zhao, Gang; Guilin, Fan; Li, Jun

doi:10.5626/jcse.2017.11.2.39

Cited by 13 publications

(6 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There are a lot of different website layouts 4 that define a website's structure. Relevant and irrelevant images are included in these layouts.…”

Section: Motivationmentioning

confidence: 99%

“…This structure is called an RDF is a W3C standard for data interchange on the Web. For more information, see https://www.w3.org/RDF/ 4 Web layouts consist of patterns that rule the structure of the document. For more examples about layouts, see https://www.w3schools.com/css/css_website_layout.asp and https://www.w3schools.com/css/css_templates.asp.…”

Section: Motivationmentioning

confidence: 99%

See 1 more Smart Citation

Automatically Discovering Relevant Images From Web Pages

et al. 2020

View full text Add to dashboard Cite

“…There are a lot of different website layouts 4 that define a website's structure. Relevant and irrelevant images are included in these layouts.…”

Section: Motivationmentioning

confidence: 99%

Section: Motivationmentioning

confidence: 99%

Automatically Discovering Relevant Images From Web Pages

et al. 2020

View full text Add to dashboard Cite

“…Once extracted, this news can be adapted to support different tasks; news extraction can be applied to generate news highlight sentences that capture the main topic within a news article [4]. Also, the fast and effective extraction of content from webpages could be used to adapt webpages for small screen devices [5] and as proclaimed [6], if we can extract the relevant content of a webpage rapidly, many semantic applications such as search engines can be developed by leveraging this.…”

Section: Introductionmentioning

confidence: 99%

A Novel Approach to News Archiving from Newswires

Muhammad-Bello

Lukman

Salim

2021

Communications in Computer and Information Science

View full text Add to dashboard Cite

A news archive is the core operational tool a media relations team depends on in order to effectively feed a data-hungry organization. An ingrained approach to news archiving in existence is the use of a relational database. As a consequence, integrating search engines that support full-text search is practically impossible due to the strict data schema that is defined in relational database systems. Therefore, there is a need for news archives that support full-text search with relevance ranking of news. In this paper, an approach that supports full-text search is proposed. The process is started by crawling newswire websites for news that are relevant with respect to some predefined keywords and extracting them. Then, they are stored in a data structure known as an inverted-index which supports full-text search, aggregation, and relevance ranking of search results. Search results are ranked and returned to a user in the order of decreasing relevance to the search term. We were able to provide a software solution written in java, the jsoup library for HTML parsing, and an elasticsearch implementation of a search engine. We tested our solution on nine newswires using ten keywords and were able to retrieve a total of 42 relevant news matching seven keywords. The approach proposed in this paper when compared to the manual approach performed better in terms of retrieval speed and accuracy. We conclude that three main components are important in a good digital archive: relevance, extraction, and search. This work is an integration of a good relevance marking technique, an extraction method, and a search engine.

show abstract

“…Obtaining these layouts has become very crucial for text processing applications such as search engines, sentiment analysis, recommendation systems, trend detection/monitoring, and e-commerce market monitoring. Many studies [1][2][3][4] in the literature focus on extracting parts of the title, summary, main text from web pages automatically. Some studies [5,6] are about obtaining the review part automatically.…”

Section: Introductionmentioning

confidence: 99%

“…CSS selectors or XML Path Language (XPath) are both capable of finding element/s containing data on this tree for the extraction process. Uzun et al [8] compare three different well-known .NET parsers, including HAP 1 , AngleSharp 2 and Microsoft (MS) HTMLDocument 3 to extract data from web pages. They use XPath patterns for this task.…”

Section: Introductionmentioning

confidence: 99%

A regular expression generator based on CSS selectors for efficient extractionfrom HTML pages

2020

Turk J Elec Eng & Comp Sci

View full text Add to dashboard Cite

Cascading Style Sheets (CSS) selectors are patterns used to select HTML elements. They are often preferred in web data extraction because they are easy to prepare and have short expressions. In order to be able to extract data from web pages by using these patterns, a Document Object Model (DOM) tree is constructed by an HTML parser for a web page. The construction process of this tree and the extraction process using this tree increase time and memory costs depending on the number of HTML elements and their hierarchies. For reducing these costs, regular expressions can be considered as a solution. However, preparing regular expression patterns is a laborious task. In this study, a heuristic approach, namely REGEXN, that automatically generates these patterns through CSS selectors is introduced and the performance gains are analyzed on a web crawler. The analysis shows that regular expression patterns generated by this approach can significantly reduce the average extraction time results from 743.31 ms to 1.03 ms when compared with the extraction process from a DOM tree. Similarly, the average memory usage drops from 1054.01 bytes to 1.59 bytes. Moreover, REGEXN can be easily adapted to the existing frameworks and tools in this task.

show abstract

Main Content Extraction from Web Pages Based on Node Characteristics

Cited by 13 publications

References 14 publications

Automatically Discovering Relevant Images From Web Pages

Automatically Discovering Relevant Images From Web Pages

A Novel Approach to News Archiving from Newswires

A regular expression generator based on CSS selectors for efficient extractionfrom HTML pages

Contact Info

Product

Resources

About