Lossless Separation of Web Pages into Layout Code and Data

Omari, Adi; Kimelfeld, Benny; Yahav, Eran; Shoham, Sharon

doi:10.1145/2939672.2939858

Cited by 8 publications

(8 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Researches on data extraction from the Deep web have been conducted by [13][14][15][16][17][18][19][20]. They are differentiated based on the number of web page inputs.…”

Section: Literature Reviewmentioning

confidence: 99%

“…They are differentiated based on the number of web page inputs. Researches on data extraction using one web page input were conducted by [13][14][15], in general they used a repeating structure of HTML tags, such as tables (<table>, <tr>, <th>, and <td>) and list (<ul > and <li>). For example, consider a conference schedule in Fig.…”

Section: Literature Reviewmentioning

confidence: 99%

“…Web pages, webs for short, will be the best potential data source for BI. Many web data extraction researches have been conducted before the Industrial 4.0 era [12][13][14][15][16][17][18][19][20].…”

Section: Introductionmentioning

confidence: 99%

“…It is about 10% 1 of the entire webs. In other words, the rest of the webs, around 90% 2 , are Deep/Dark webs [13][14][15][16][17][18][19][20]. Google could not index the Deep/Dark webs information, so Google cannot search for the Deep webs.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Transformation from Web Pages to Optimal Normal Form Database Schema Using a Conceptual Schema Approach

Yuliana

Chittayasothorn

2021

2021 7th International Conference on Engineering, Applied Sciences and Technology (ICEAST)

View full text Add to dashboard Cite

Web pages and their embedded documents are a good source of information. However, issuing complex queries directly on web pages using direct pattern-matching techniques are challenging tasks that require lengthy procedural programming. Also, programmers working directly on web pages are in charge of the search path and routing, which depend on individual document structures and affect the correctness, completeness, and performance of the results. In contrast, relational databases are well-structured and backed by mathematical principles. Well-designed relational database structures are known to be anomalies-free. The standard relational database language, SQL, is a non-procedural language that defines the required results precisely. Query results are both correct and complete. Performance issues are handled by intelligent query optimizers employed by modernday Database Management System (DBMS). This paper suggests an approach that transforms documents embedded on web pages in HTML format to corresponding relational database structures and populations. Functional dependencies (FDs) and multi-valued dependencies (MVDs) obtained from documents on the webpages are used to construct conceptual schema diagrams, which are further transformed into the Optimal Normal Form (ONF) relational database structures. In this research project, the Object Role Model (ORM) conceptual schema model is employed. The paper discusses the ORM and the rationales behind its usage. The detection of FDs and MVDs from webpage documents and the technical properties of the ONF relational database structures. Illustrated examples are also provided.

show abstract

“…Researches on data extraction from the Deep web have been conducted by [13][14][15][16][17][18][19][20]. They are differentiated based on the number of web page inputs.…”

Section: Literature Reviewmentioning

confidence: 99%

Section: Literature Reviewmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Transformation from Web Pages to Optimal Normal Form Database Schema Using a Conceptual Schema Approach

Yuliana

Chittayasothorn

2021

2021 7th International Conference on Engineering, Applied Sciences and Technology (ICEAST)

View full text Add to dashboard Cite

show abstract

“…Relying on the HTML DOM tree structure makes it difficult to train a machine learning based model for publication extraction because: (i) Text in a publication string may be separated in many different DOM tree nodes. (ii) The DOM tree structure, which previous web data record extraction systems (Liu et al, 2003;Furche et al, 2014;Omari et al, 2016) rely on, may vary given the same webpage content.…”

Section: Related Workmentioning

confidence: 99%

PubSE: A Hierarchical Model for Publication Extraction from Academic Homepages

Zhang¹,

Qi²,

Zhang³

et al. 2018

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Publication information in a researcher's academic homepage provides insights about the researcher's expertise, research interests, and collaboration networks. We aim to extract all the publication strings from a given academic homepage. This is a challenging task because the publication strings in different academic homepages may be located at different positions with different structures. To capture the positional and structural diversity, we propose an end-to-end hierarchical model named PubSE based on Bi-LSTM-CRF. We further propose an alternating training method for training the model. Experiments on real data show that PubSE outperforms the stateof-the-art models by up to 11.8% in F1-score.

show abstract

Web Page Template and Data Separation for Better Maintainability

Zhao

Zhang

2018

Web Information Systems Engineering – WISE 2018

View full text Add to dashboard Cite

Lossless Separation of Web Pages into Layout Code and Data

Cited by 8 publications

References 36 publications

Transformation from Web Pages to Optimal Normal Form Database Schema Using a Conceptual Schema Approach

Transformation from Web Pages to Optimal Normal Form Database Schema Using a Conceptual Schema Approach

PubSE: A Hierarchical Model for Publication Extraction from Academic Homepages

Web Page Template and Data Separation for Better Maintainability

Contact Info

Product

Resources

About