The field of Natural Language Processing (NLP) focuses on developing computational techniques to analyze and extract information from human language. With the exponential growth of unstructured textual data, NLP-based techniques have become essential for extracting valuable insights from this data. However, existing information extraction systems have limitations in terms of extracting valuable information without predefined relations or ontology and storing the extracted knowledge effectively. This Ph.D. thesis aims to enhance open information extraction methods to represent unstructured textual data efficiently and effectively.
The first part of the research focuses on Open Information Extraction (OIE) systems and their challenges. Existing OIE methods, including pattern-based and machine learning-based approaches, as well as neural techniques, are analyzed to understand their limitations. A Bidirectional Gated Recurrent Unit (Bi-GRU) OIE model is proposed in Chapter 3, which utilizes contextualized word embeddings to extract relevant triples from unstructured text. Experimental results demonstrate the effectiveness of this model in generating high-quality relation triples.
Chapter 4 addresses the lack of labeled data, a common problem in NLP tasks. The research extends the OIE model from Chapter 3 by using learned features to generate relation triples and explores the transferability of these features across different OIE domains and the related task of Relation Extraction (RE). The results show comparable performance with traditional training, indicating the potential of OIE in achieving NLP performance without labeled data.
In Chapter 5, the focus shifts to enhancing pre-trained language models for taxonomy classification. Pre-trained language models often struggle with unseen patterns during inference, and the limited size of annotated data poses a challenge. A two-stage fine-tuning procedure, incorporating data augmentation techniques, is proposed to improve the generalizability of pre-trained models. Experimental results demonstrate strong generalizability on unseen data, with an F1 score of 91.25%.
Chapter 6 explores the use of OIE for constructing a knowledge graph, specifically in the context of cyber threat intelligence. Open-CyKG, an open cyber threat intelligence knowledge graph framework, is designed using an attention-based neural OIE model and a Named Entity Recognition (NER) model. Refinement and canonicalization techniques are employed to overcome ambiguity and data redundancy during knowledge graph construction. The results show that querying the constructed knowledge graph can be done efficiently, highlighting the support of OIE in knowledge graph development. The proposed components achieve beyond-state-of-the-art results in terms of OIE performance, NER performance, and knowledge graph canonicalization.
The research presented in the previous chapters demonstrates significant improvements in the efficiency and effectiveness of open information extraction methods for representing unstructured textual data. These advancements leverage techniques such as data augmentation, multi-stage fine-tuning, and pre-trained language models. The construction of knowledge graphs, enabled by OIE, has the potential to mimic human intelligence and benefit various complex applications, including recommender systems, search engines, and dialog systems.