A Graph Theoretical Preprocessing Step for Text Compression

Phukon, Kaushik K.; Baruah, Hemanta K.

doi:10.14257/ijmue.2015.10.5.24

Cited by 1 publication

(2 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It seems to be obvious from the above definition that every binary relation on a finite set can be represented by a digraph without parallel edges. The composite graph model (CGM) (22) which was modeled by this author in the year 2012 which represents a web document as a directed and completely labeled graph. The CGM was developed with help of the Tag Sensitive Graph Model (TSGM) (22) and Context-Sensitive Graph model (CSGM) (22) .…”

Section: Web Document Content Mining Processmentioning

confidence: 99%

“…The composite graph model (CGM) (22) which was modeled by this author in the year 2012 which represents a web document as a directed and completely labeled graph. The CGM was developed with help of the Tag Sensitive Graph Model (TSGM) (22) and Context-Sensitive Graph model (CSGM) (22) . In the composite graph representation, we are using the TSGM to represent three sections of a general web page namely head, link and address.…”

Section: Web Document Content Mining Processmentioning

confidence: 99%

See 1 more Smart Citation

Incorporation of contextual information through Graph Modeling in Web content mining

Phukon¹

2020

IJST

View full text Add to dashboard Cite

Objectives:The objectives of this research article is to deal with the problem of web document clustering by modeling the web documents as directed completely labeled graphs that incorporate contextual information in the computation process to the extent required. The computational complexity of the MCS algorithm based on this graph model is O(n 2 ), n being the number of nodes. As graph similarity using MCS is an NP-complete problem, so this is an important result that allows us to forgo sub-optimal approximation approaches and find the exact solution in polynomial time. Method: The first step towards this new approach of web document clustering is the representation of the web documents with the help of a directed completely labeled graph that can retain contextual information of the document under consideration. After graphical modeling of the document, the next step is the calculation of similarity between the graphical objects. For this purpose, a customized algorithm proposed as Algorithm for Maximum Common Subgraph Isomorphism (AMCSI) (1) based on a backtracking search scheme is being used. The proposed AMCSI algorithm is solving the problem of maximum common subgraph isomorphism in polynomial time. After obtaining the value for the similarity between the graphical objects we are again using a customized fuzzy-c means algorithm to produce clusters from the target set of web documents. We are using multidimensional scaling to express the distance values between the web pages (graphs) in two coordinates (x,y) and deterministic sampling to calculate the graph median in the process of fuzzy c-means clustering. Findings: We present an alternative method for web document clustering by representing the web documents as directed completely labeled graphs where the computational complexity of the MCS algorithm is O(n 2 ) (1) . A new distance measure is also developed based on the directed completely labeled graph representation which is giving 16.9% better result than the prevailing methods (2) . For the clustering purpose, we have chosen the fuzzy cmeans clustering algorithm and customizing the original algorithm to fit with graphical objects. This approach enables us to model the web documents as graphs without discarding contextual information and then cluster these graphical objects with the help of a well-established clustering algorithm.

show abstract

Section: Web Document Content Mining Processmentioning

confidence: 99%