TEX: An efficient and effective unsupervised Web information extractor

Sleiman, Hassan A.; Corchuelo, Rafael

doi:10.1016/j.knosys.2012.10.009

Cited by 46 publications

(29 citation statements)

References 59 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…On the contrary, more different rules are encouraged to use when facing different tasks. In addition, two third-party tools can function together: HTML tidy [3] and HTML Parser [7]. The former is a proposal that is intended to preprocess web documents by fixing their HTML code and converting it into XHTML.…”

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

An Analysis of Characters and Structures of Web Pages Based on Regular Expressions

2014

Proceedings of the 3rd International Conference on Computer Science and Service System

View full text Add to dashboard Cite

Section: Discussionmentioning

confidence: 99%

“…The existing proposals work on one or more input web document and search for repetitive structures that hopefully identify the regions where the relevant information insides [3]. But the structures of documents varies enormously in a real-world application.…”

Section: Introductionmentioning

confidence: 99%

An Analysis of Characters and Structures of Web Pages Based on Regular Expressions

2014

Proceedings of the 3rd International Conference on Computer Science and Service System

View full text Add to dashboard Cite

“…Reis et al [41] proposed a tree edit distance method to derive a template underlying sample pages and used the derived template for data extraction. Recently, Sleiman and Corchuelo proposed an efficient simple multi-string alignment algorithm for recognizing a template and its variable contents [49]. The above approaches [6,14,16,41,49] do not require manually labeled data, which greatly reduces the manual effort in the data extraction process.…”

Section: Related Workmentioning

confidence: 99%

“…Recently, Sleiman and Corchuelo proposed an efficient simple multi-string alignment algorithm for recognizing a template and its variable contents [49]. The above approaches [6,14,16,41,49] do not require manually labeled data, which greatly reduces the manual effort in the data extraction process. However, they require that Web pages being analyzed must follow the same template.…”

Section: Related Workmentioning

confidence: 99%

Specification and discovery of web patterns: a graph grammar approach

Roudaki

Kong

Zhang

2016

Information Sciences

View full text Add to dashboard Cite

“…Kushmerick et al [11] pioneered this field with a proposal that learns token patterns that characterise the context of the information to extract; Hsu and Dung [8] devised a proposal that first learns an automaton that models the information to extract and then regular expressions to model transitions; Hogue and Karger [7] presented a proposal that is based on tree similarity; Álvarez et al [1] devised a proposal that relies on clustering, tree matching, string matching, and string alignment; Crescenzi and Merialdo [4] presented a proposal to infer a regular expression that models the differences amongst a number of documents, which are typically the information of interest; Kayed and Chang [9] devised a technique to learn rules that are context-free grammars; and Sleiman and Corchuelo [14,15] presented two proposals that are based on multi-string alignment.…”

Section: Related Workmentioning

confidence: 99%