2011
DOI: 10.1117/12.876708
|View full text |Cite
|
Sign up to set email alerts
|

Title identification of web article pages using HTML and visual features

Abstract: Extracting informative content from Web article pages has many applications such as printing and content reuse. Title is a very significant and unique component of an article. However, identifying the true title is not an easy problem even for human readers. In this paper, we present a title identification method that takes into account of several features including the title field of the HTML page and HTML tag of a DOM node as well as font size and horizontal alignment. We tested our method on a ground truth … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
6
0

Year Published

2011
2011
2019
2019

Publication Types

Select...
2
2
1

Relationship

0
5

Authors

Journals

citations
Cited by 7 publications
(6 citation statements)
references
References 2 publications
0
6
0
Order By: Relevance
“…[18] gives a survey of web information extraction systems. [1,2,3] have studied main article extraction on the Web; [4] focuses on the identification of titles in web pages, [5] investigates authors extraction of web document. Earlier, [6] has proposed a rule learning algorithm LP2, and implemented an automatic structural extraction module used by some semantic structural extraction systems [9].…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…[18] gives a survey of web information extraction systems. [1,2,3] have studied main article extraction on the Web; [4] focuses on the identification of titles in web pages, [5] investigates authors extraction of web document. Earlier, [6] has proposed a rule learning algorithm LP2, and implemented an automatic structural extraction module used by some semantic structural extraction systems [9].…”
Section: Related Workmentioning
confidence: 99%
“…Extracting structural informative content from web pages has attracted many re-search efforts in recent years [1,2,3,4,5]. The identification and extraction of the main article [1,2,3], paper title [4] and authors [5] is studied in the literature.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Tagging and extracting informative content from web pages has attracted many research e®orts in recent years [1][2][3][4][5]. The identi¯cation of titles [4], authors [5], and the main article [1][2][3] is studied in the literature.…”
Section: Introductionmentioning
confidence: 99%
“…The identi¯cation of titles [4], authors [5], and the main article [1][2][3] is studied in the literature. In this paper, we focus on identifying general semantic groups from data-intensive web pages with a hybrid approach which combines user guidance and unsupervised clustering techniques.…”
Section: Introductionmentioning
confidence: 99%