Nattapong TONGTEP †a) , Student Member and Thanaruk THEERAMUNKONG †b) , Member
SUMMARYExtracting named entities (NEs) and their relations is more difficult in Thai than in other languages due to several Thai specific characteristics, including no explicit boundaries for words, phrases and sentences; few case markers and modifier clues; high ambiguity in compound words and serial verbs; and flexible word orders. Unlike most previous works which focused on NE relations of specific actions, such as work for, live in, located in, and kill, this paper proposes more general types of NE relations, called predicate-oriented relation (PoR), where an extracted action part (verb) is used as a core component to associate related named entities extracted from Thai Texts. Lacking a practical parser for the Thai language, we present three types of surface features, i.e. punctuation marks (such as token spaces), entity types and the number of entities and then apply five alternative commonly used learning schemes to investigate their performance on predicate-oriented relation extraction. The experimental results show that our approach achieves the F-measure of 97.76%, 99.19%, 95.00% and 93.50% on four different types of predicate-oriented relation (action-location, location-action, action-person and person-action) in crime-related news documents using a data set of 1,736 entity pairs. The effects of NE extraction techniques, feature sets and class unbalance on the performance of relation extraction are explored. key words: relation extraction, named entity, surface feature, information extraction
IntroductionRecently several information extraction (IE) approaches have been proposed to transform an unstructured text into knowledge base, such as those in [1 [25] presented a so-called CORDER system to find relations among entities in an organization's documents on a social network. The mined knowledge was in the form of who works with whom, on which projects and with which customers, using strength measured for each co-occurring NE based on its co-occurrences and distances with the target. The CORDER comprised the steps of data selection, named entity recognition and ranking by relation strengths.As an integrated community project, tasks of entity and relation extraction from English, Chinese and Arabic texts were conducted in the Automatic Content Extraction (ACE) program * , including three sets of annotation tasks; Entity Detection and Tracking (EDT), Relation Detection and Characterization (RDC), and Event Detection and Characterization (EDC) [26]. Three main EDT tasks were the detection of entities mentioned in a document, the tracking of entities * http://projects.ldc.upenn.edu/ace/