Title identification of web article pages using HTML and visual features

Fan, Jian; Luo, Ping; Joshi, Parag

doi:10.1117/12.876708

Cited by 7 publications

(6 citation statements)

References 2 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…[18] gives a survey of web information extraction systems. [1,2,3] have studied main article extraction on the Web; [4] focuses on the identification of titles in web pages, [5] investigates authors extraction of web document. Earlier, [6] has proposed a rule learning algorithm LP2, and implemented an automatic structural extraction module used by some semantic structural extraction systems [9].…”

Section: Related Workmentioning

confidence: 99%

“…Extracting structural informative content from web pages has attracted many re-search efforts in recent years [1,2,3,4,5]. The identification and extraction of the main article [1,2,3], paper title [4] and authors [5] is studied in the literature.…”

Section: Introductionmentioning

confidence: 99%

“…The identification and extraction of the main article [1,2,3], paper title [4] and authors [5] is studied in the literature. In this paper, we focus on extracting general structural groups from data-intensive web pages with PQL (Page Query Language), a SQL like query language.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Page query language generation for structural extraction

2014

2014 11th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD)

View full text Add to dashboard Cite

The information on the Web is usually fabricated to be understandable by human users rather than machines. It's not easy to automatically catalogue and extract the Web information solely with a software agent. Based on these observations, we present an approach that uses human guided operations to automatically generate a PQL query, a SQL like query language focusing on Web pages, to extract the interested information fragments on Web pages. The PQL query uses XPath expressions to locating the target HTML nodes. We develop a K-Medoid clustering algorithm to process PQL queries to generate the structural extractions. The extracted information is structured as a relational table (in CSV format) which can be manipulated smoothly with spreadsheet software or a relational DBMS system.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Page query language generation for structural extraction

2014

2014 11th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD)

View full text Add to dashboard Cite

show abstract

“…Tagging and extracting informative content from web pages has attracted many research e®orts in recent years [1][2][3][4][5]. The identi¯cation of titles [4], authors [5], and the main article [1][2][3] is studied in the literature.…”

Section: Introductionmentioning

confidence: 99%

“…The identi¯cation of titles [4], authors [5], and the main article [1][2][3] is studied in the literature. In this paper, we focus on identifying general semantic groups from data-intensive web pages with a hybrid approach which combines user guidance and unsupervised clustering techniques.…”

Section: Introductionmentioning

confidence: 99%

Semi-Automatic Online Tagging with K-Medoid Clustering

2014

Int. J. Soft. Eng. Knowl. Eng.

View full text Add to dashboard Cite

Online tagging is crucial for the acquisition and organization of web knowledge. We present TYG (Tag-as-You-Go) in this paper, a web browser extension for online tagging of personal knowledge on standard web pages. We investigate an approach to combine a K-Medoidstyle clustering algorithm with the user input to achieve semi-automatic web page annotation. The annotation process supports user-de¯ned tagging schema and comprises an automatic mechanism that is built upon clustering techniques, which can automatically group similar HTML DOM nodes into clusters corresponding to the user speci¯cation. TYG is a prototype system illustrating the proposed approach. Experiments with TYG show that our approach can achieve both e±ciency and e®ectiveness in real world annotation scenarios.

show abstract

TYG: A Tag-as-You-Go Online Annotation Tool for Web Browsing and Navigation

2013

Knowledge Science, Engineering and Management

View full text Add to dashboard Cite

Title identification of web article pages using HTML and visual features

Cited by 7 publications

References 2 publications

Page query language generation for structural extraction

Page query language generation for structural extraction

Semi-Automatic Online Tagging with K-Medoid Clustering

TYG: A Tag-as-You-Go Online Annotation Tool for Web Browsing and Navigation

Contact Info

Product

Resources

About