TEXT: Automatic Template Extraction from Heterogeneous Web Pages

Kim, Chulyun; Shim, Kyuseok

doi:10.1109/tkde.2010.140

Cited by 44 publications

(24 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…So, different values are assigned to each tag and the total value is calculated which is used to decide whether a block is noise [3]. C. Kim and K. Shim propose a novel algorithm to extract template from numerous documents constructed by heterogeneous template [4].…”

Section: Content Extractionmentioning

confidence: 99%

Authoring of Personalized Web Page from Heterogeneous Web Pages by Content Extraction and Integration

Weigang¹,

Sun²,

Wang³

2017

Proceedings of the International Conference on Computer Networks and Communication Technology (CNCT 2016)

View full text Add to dashboard Cite

Abstract. Authoring of personalized Web page by integrating heterogeneous Web page elements from different sites is a challenging task in Web 2.0 applications. An approach to extract various of partitions or elements, which can be the basic HTML elements, CSS definitions, JavaScript source code, etc, from different Web sites, thus implementing authoring of new page from heterogeneous Web pages is proposed in this paper. A novel DOM tree model, CS-DOM tree, is introduced to retrieve the CSS definitions. In order to assure that the new Web pages keep updating synchronized with the source pages, a method based on the structure of DOM and the context of elements to relocate the elements that have been retrieved before is then presented. The similarity calculation algorithm used to judge whether the relocated elements and the elements retrieved before are from the same position is developed at last. The method proposed in this paper has been applied to develop a personalized portal.

show abstract

Section: Content Extractionmentioning

confidence: 99%

Authoring of Personalized Web Page from Heterogeneous Web Pages by Content Extraction and Integration

Weigang¹,

Sun²,

Wang³

2017

Proceedings of the International Conference on Computer Networks and Communication Technology (CNCT 2016)

View full text Add to dashboard Cite

show abstract

“…Now, assume a web application"s page w ∈ W, we categorize it into some of likely template (L) and recover various output parameters (O) consequently. Procedure for drawing out patterns of web page or template embedded in web page has been obtainable in current state-of-techniques [13,14]. Although, we utilize some of the techniques from TEXT [13] that highlights the DOM tree of a page reflecting the required paths.…”

Section: Php Web Page Representationmentioning

confidence: 99%

“…Procedure for drawing out patterns of web page or template embedded in web page has been obtainable in current state-of-techniques [13,14]. Although, we utilize some of the techniques from TEXT [13] that highlights the DOM tree of a page reflecting the required paths. Our web page pattern mining technique comprises subsequent four stages.…”

Section: Php Web Page Representationmentioning

confidence: 99%

PHP-sensor

Gupta

2015

Proceedings of the 12th ACM International Conference on Computing Frontiers

View full text Add to dashboard Cite

As the usage of web applications for security-sensitive facilities has enlarged, the quantity and cleverness of web-based attacks against the web applications have grown-up as well. Several annual cyber security reports revealed that modern web applications suffer from two main categories of attacks: Workflow Violation Attacks and Cross-Site Scripting (XSS) attacks. Presently, in comparison to XSS attacks, there have been actual restricted work carried out that discover workflow violation attacks, as web application logic errors are particular to the expected functionality of a specific web application.This paper presents PHP-Sensor, a novel defensive model that discovers both the vulnerabilities of workflow violation attack and XSS attack concurrently in the real world PHP web applications. For the workflow violation attack, we extract a certain set of axioms by monitoring the sequences of HTTP request/responses and their corresponding session variables during the offline mode. The set of axioms is then utilized for evaluating the HTTP request/response in online mode. Any HTTP request/ response that bypass the corresponding axiom is recognized as a workflow violation attack in PHP web application. For the XSS attack, PHP-Sensor discovers the self-propagating features of XSS worms by monitoring the outgoing HTTP web request with the scripts that are injected in the currently HTTP response web page. We develop prototype of our proposed defensive model on the web proxy as well as on the client-side for the recognition of workflow violation and XSS attacks respectively. We evaluate the detection capability of PHP-Sensor on open source real-world PHP web applications and the simulation outcomes reveal that our defensive model is efficient and feasible at discovering workflow violation attacks, XSS attacks and experiences tolerable performance overhead.

show abstract

“…Note in the previous model this search space is equal to all other clusters in set C. Now we can start to find the first best merging pair in GetInitBestPair method. To calculate the MDL cost we have to calculate the number of 1's in MT,-1 & +1 in M∆ according to Lemma3 [8]. And finally we can merge the two clusters whose MDL cost is the minimum one.…”

Section: International Journal Of Computer Applications (0975 -8887) mentioning

confidence: 99%

“…To simplify the clustering calculation the web document is represented in DOM tree format [5]. After that essential paths of this web document is extracted and represented into the ME matrix [8]. Each column & row in ME represents the document and the essential paths of that document respectively.…”

Section: Examplementioning

confidence: 99%

Template Extraction from Heterogeneous Web Pages with Cosine Similarity

Kulkarni¹,

Patil²

2014

IJCA

View full text Add to dashboard Cite

Now a day's detection of templates from a large number of web pages has received a lot of attention. Template detection technique improves the performance of clustering, classification & search engines. In our work we proposed a novel algorithm by using cosine similarity based Template Extraction. We are using the cosine similarity approach to cluster the web documents. With the help of underlying structure of web documents we found the template for individual cluster. Our experimental evaluation show that our approach is effective in terms of computing Time and Clustering cost.

show abstract

TEXT: Automatic Template Extraction from Heterogeneous Web Pages

Cited by 44 publications

References 22 publications

Authoring of Personalized Web Page from Heterogeneous Web Pages by Content Extraction and Integration

Authoring of Personalized Web Page from Heterogeneous Web Pages by Content Extraction and Integration

PHP-sensor

Template Extraction from Heterogeneous Web Pages with Cosine Similarity

Contact Info

Product

Resources

About