2013
DOI: 10.1016/j.knosys.2012.10.009
|View full text |Cite
|
Sign up to set email alerts
|

TEX: An efficient and effective unsupervised Web information extractor

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
29
0

Year Published

2014
2014
2017
2017

Publication Types

Select...
3
3
1

Relationship

1
6

Authors

Journals

citations
Cited by 46 publications
(29 citation statements)
references
References 59 publications
0
29
0
Order By: Relevance
“…On the contrary, more different rules are encouraged to use when facing different tasks. In addition, two third-party tools can function together: HTML tidy [3] and HTML Parser [7]. The former is a proposal that is intended to preprocess web documents by fixing their HTML code and converting it into XHTML.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…On the contrary, more different rules are encouraged to use when facing different tasks. In addition, two third-party tools can function together: HTML tidy [3] and HTML Parser [7]. The former is a proposal that is intended to preprocess web documents by fixing their HTML code and converting it into XHTML.…”
Section: Discussionmentioning
confidence: 99%
“…The existing proposals work on one or more input web document and search for repetitive structures that hopefully identify the regions where the relevant information insides [3]. But the structures of documents varies enormously in a real-world application.…”
Section: Introductionmentioning
confidence: 99%
“…Reis et al [41] proposed a tree edit distance method to derive a template underlying sample pages and used the derived template for data extraction. Recently, Sleiman and Corchuelo proposed an efficient simple multi-string alignment algorithm for recognizing a template and its variable contents [49]. The above approaches [6,14,16,41,49] do not require manually labeled data, which greatly reduces the manual effort in the data extraction process.…”
Section: Related Workmentioning
confidence: 99%
“…Recently, Sleiman and Corchuelo proposed an efficient simple multi-string alignment algorithm for recognizing a template and its variable contents [49]. The above approaches [6,14,16,41,49] do not require manually labeled data, which greatly reduces the manual effort in the data extraction process. However, they require that Web pages being analyzed must follow the same template.…”
Section: Related Workmentioning
confidence: 99%
“…Kushmerick et al [11] pioneered this field with a proposal that learns token patterns that characterise the context of the information to extract; Hsu and Dung [8] devised a proposal that first learns an automaton that models the information to extract and then regular expressions to model transitions; Hogue and Karger [7] presented a proposal that is based on tree similarity; Álvarez et al [1] devised a proposal that relies on clustering, tree matching, string matching, and string alignment; Crescenzi and Merialdo [4] presented a proposal to infer a regular expression that models the differences amongst a number of documents, which are typically the information of interest; Kayed and Chang [9] devised a technique to learn rules that are context-free grammars; and Sleiman and Corchuelo [14,15] presented two proposals that are based on multi-string alignment.…”
Section: Related Workmentioning
confidence: 99%