2003
DOI: 10.1007/978-3-540-39718-2_34
|View full text |Cite
|
Sign up to set email alerts
|

Automatic Annotation of Content-Rich HTML Documents: Structural and Semantic Analysis

Abstract: Abstract. Although RDF/XML has been widely recognized as the standard vehicle for representing semantic information on the Web, an enormous amount of semantic data is still being encoded in HTML documents that are designed primarily for human consumption and not directly amenable to machine processing. This paper seeks to bridge this semantic gap by addressing the fundamental problem of automatically annotating HTML documents with semantic labels. Exploiting a key observation that semantically related items ex… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
29
0

Year Published

2003
2003
2011
2011

Publication Types

Select...
5
2
2

Relationship

2
7

Authors

Journals

citations
Cited by 47 publications
(29 citation statements)
references
References 32 publications
0
29
0
Order By: Relevance
“…We use the observation that semantically related content elements in a web page exhibit spatial locality [64,65] and often share the same alignment (matching X or Y coordinate) on a web page. Since a frame tree represents the layout of a web page, we infer that geometrical alignment of frames may imply semantic relationship between their respective content.…”
Section: Geometric Segmentationmentioning
confidence: 99%
See 1 more Smart Citation
“…We use the observation that semantically related content elements in a web page exhibit spatial locality [64,65] and often share the same alignment (matching X or Y coordinate) on a web page. Since a frame tree represents the layout of a web page, we infer that geometrical alignment of frames may imply semantic relationship between their respective content.…”
Section: Geometric Segmentationmentioning
confidence: 99%
“…These techniques are either domain [36] or site [83] specific or depend on fixed sets of HTML markup [91]. Semantic partitioning of Web pages has been described in [63][64][65]. These systems require semantic information (e.g.…”
Section: Semantic Analysis Of Web Contentmentioning
confidence: 99%
“…We use the five-class scheme and adopt the layout features studied by Xiao et al [16] Web page segmentation is the preparing step for classifying block functions. It is a challenging task and has been widely studied [1,2,5,6,8,10,12,14,15]. Early researches relied on cues from HTML DOM trees, contents and links.…”
Section: Related Workmentioning
confidence: 99%
“…The method column shows the classification that is defined in section 2. The most common SAP techniques are manually-created rules [16], pattern matching [18], automatic discovery of patterns [10], and wrapper induction, either linguistic [12] or structural based [17]. While the machine learning methods, such as those used by Amilcare [12], usually perform better [21], the rule-based MUSE system using conditional processing has shown that rule-based systems can equal the performance of machine learning-based systems [16].…”
Section: Platform Summarymentioning
confidence: 99%