The world is fighting an unprecedented coronavirus pandemic, and no country was prepared for it. Understanding the nature of this disease, when there is no available cure, is vital to encourage accurate clinical diagnosis and drug discovery prospects. When the amount of literature available is vast, it is important to represent the disease domain as completely as possible. The system should capture the morphology, semantics, syntax, and pragmatics of the given literature, in order to extract useful information. Also, building a classifier for a particular domain suffers from a zero frequency issue. To solve this effectively, latent topics are extracted and semantically represented in ontology to build a text classifier for coronavirus literature. The classifier is equipped with 2 components- ‘ontology’ and ‘machine learning data model’. Ontology helps to model the morphology and the semantic and pragmatic aspects of the text data through Latent Drichlet Allocation (LDA). It also preserves the contextual information in the document space, providing holistic feature representation facilities. To solve zero frequency and to extract actionable insights, a machine learning algorithm, Multi class Support Vector Machine (M-SVM), is incorporated with the ontology. It encodes features and achieves a classifier with highly discriminated classes. Further, to preserve contextual information space, and to enable data model formulation, the ontology is generated as a knowledge graph with their respective predefined classes. The resulting dataset can be used for clinical diagnosis and further research on the disease. Experimental results have shown that the proposed classifier outperforms the existing systems, with better domain representation.
HIGHLIGHTS
When the amount of literature available is vast, it is important to represent the disease domain as completely as possible. The system should capture the morphology, semantics, syntax, and pragmatics of the given literature, in order to extract useful information
The classifier is equipped with 2 components- ‘ontology’ and ‘machine learning data model’. Ontology helps to model the morphology and the semantic and pragmatic aspects of the text data through Latent Drichlet Allocation (LDA). It also preserves the contextual information in the document space, providing holistic feature representation facilities
To preserve contextual information space, and to enable data model formulation, the ontology is generated as a knowledge graph with their respective predefined classes. The resulting dataset can be used for clinical diagnosis and further research on the disease
GRAPHICAL ABSTRACT