Open-Source INTelligence is intelligence based on publicly available sources such as news sites, blogs, forums, etc. The Web is the primary source of information, but once data are crawled, they need to be interpreted and structured.Ontologies may play a crucial role in this process, but because of the vast amount of documents available, automatic mechanisms for their population are needed, starting from the crawled text. This paper presents an approach for the automatic population of predefined ontologies with data extracted from text and discusses the design and realization of a pipeline based on the General Architecture for Text Engineering system, which is interesting for both researchers and practitioners in the field. Some experimental results that are encouraging in terms of extracted correct instances of the ontology are also reported. Furthermore, the paper also describes an alternative approach and provides additional experiments for one of the phases of our pipeline, which requires the use of predefined dictionaries for relevant entities. Through such a variant, the manual workload required in this phase was reduced, still obtaining promising results.
KEYWORDSgeneral architecture for text engineering (GATE), information extraction, internet as a data source, ontology population 2302 in an automatic way was not here taken into consideration. Domain ontologies usually have a very complex intentional structure to satisfy all the needs of the specific application domain at hand, and typically their design cannot be completely automatized. 3,5 In this work, we, instead, have pursued an approach for information extraction that assumed that a domain ontology was already available and that text should be mined to extract instances of ontology predicates.In order to solve this task, Named Entity Recognition (NER) 6 can be initially adopted. However, this approach only focuses on recognizing specific types of concepts (eg, persons, places, or organizations), without considering relationships among them. The Relational Information Extraction attempts, in addition, to identify the relationships that exist among concepts, as allowed by, eg, SystemT, * an IBM-developed commercial tool that is used to derive rich information from documents and emails. Our approach, thus, adopted both NER and Relational Extraction techniques, using open-source technologies.Among various existing open-source tools for information extraction (eg, LingPipe † or OpenNLP ‡ ), we decided to use General Architecture for Text Engineering (GATE), § given the flexibility it allows to customize its underlying architecture and to incorporate other external components developed by third parties. The extraction activity in GATE is typically carried out through different stages, each depending on the contingent needs of the user, who can, for instance, adopt existing dictionaries (aka Gazetteers) for NER or create new ones and/or specify tailored extraction rule sets through the use of the Java Annotation Pattern Language (JAPE). 7 In addition, GATE...