Abstract:We propose a machine learning method to automatically classify the extracted ngrams from a corpus into terms and non-terms. We use 10 common statistics in previous term extraction literature as features for training. The proposed method, applicable to term recognition in multiple domains and languages, can help 1) avoid the laborious work in the post-processing (e.g. subjective threshold setting); 2) handle the skewness and demonstrate noticeable resilience to domain-shift issue of training data. Experiments a… Show more
“…4. In the table, [5] achieves a satisfying result with Random Forest method in their paper, but the feature preparation is complex and time-consuming. The training data is also re-balanced on the positive and negative instances.…”
Section: Results and Analysismentioning
confidence: 97%
“…2. Yuan et al [5], a feature-based machine learning method using n-grams as term candidates and 10 kinds features is pre-processed for each candidate. Our best models for testing are chosen with the loss of development dataset Noted that: Decreasing the term ratio α will increase the precision but degenerate the recall (Fig.…”
Section: Results and Analysismentioning
confidence: 99%
“…Machine-learning based ATE [5,6,7,8] is to design and learn different features in the raw text or from syntax information, and then integrate these features into a machine learning method (such as conditional random field, supporting vector classifier). However, different domain, especially language shares different feature patterns, making this method specified to one language or domain.…”
Section: Related Workmentioning
confidence: 99%
“…2 for more details about term span. [1,3], [1,4], [1,5], [2,2], [2,3], [2,4], [2,5], [3,3], [3,4], [3,5], [4,4], [4,5], [5,5]…”
In this paper, we proposed a deep learning-based end-toend method on domain specified automatic term extraction (ATE), it considers possible term spans within a fixed length in the sentence and predicts them whether they can be conceptual terms. In comparison with current ATE methods, the model supports nested term extraction and does not crucially need extra (extracted) features. Results show that it can achieve a high recall and a comparable precision on term extraction task with inputting segmented raw text.
“…4. In the table, [5] achieves a satisfying result with Random Forest method in their paper, but the feature preparation is complex and time-consuming. The training data is also re-balanced on the positive and negative instances.…”
Section: Results and Analysismentioning
confidence: 97%
“…2. Yuan et al [5], a feature-based machine learning method using n-grams as term candidates and 10 kinds features is pre-processed for each candidate. Our best models for testing are chosen with the loss of development dataset Noted that: Decreasing the term ratio α will increase the precision but degenerate the recall (Fig.…”
Section: Results and Analysismentioning
confidence: 99%
“…Machine-learning based ATE [5,6,7,8] is to design and learn different features in the raw text or from syntax information, and then integrate these features into a machine learning method (such as conditional random field, supporting vector classifier). However, different domain, especially language shares different feature patterns, making this method specified to one language or domain.…”
Section: Related Workmentioning
confidence: 99%
“…2 for more details about term span. [1,3], [1,4], [1,5], [2,2], [2,3], [2,4], [2,5], [3,3], [3,4], [3,5], [4,4], [4,5], [5,5]…”
In this paper, we proposed a deep learning-based end-toend method on domain specified automatic term extraction (ATE), it considers possible term spans within a fixed length in the sentence and predicts them whether they can be conceptual terms. In comparison with current ATE methods, the model supports nested term extraction and does not crucially need extra (extracted) features. Results show that it can achieve a high recall and a comparable precision on term extraction task with inputting segmented raw text.
“…Given training data, machine learning based methods [Astrakhantsev 2014;Conrado et al 2013;Fedorenko et al 2014;Maldonado and Lewis 2016] typically transform training instances into a feature space and train a classifier that can be later used for prediction. The features can be linguistic (e.g., PoS pattern, presence of special characters, etc), or statistical or a combination of both, which often utilise scores calculated by statistical ATE metrics [Maldonado and Lewis 2016;Yuan et al 2017]. However, one of the major problems in applying machine learning to ATE is the availability of reliable training data.…”
Section: Classic Unithood and Termhood Based Methodsmentioning
Automatic Term Extraction deals with the extraction of terminology from a domain specific corpus, and has long been an established research area in data and knowledge acquisition. ATE remains a challenging task as it is known that there is no existing ATE methods that can consistently outperform others in any domain. This work adopts a refreshed perspective to this problem: instead of searching for such a 'one-size-fit-all' solution that may never exist, we propose to develop generic methods to 'enhance' existing ATE methods. We introduce SemRe-Rank, the first method based on this principle, to incorporate semantic relatedness-an often overlooked venue-into an existing ATE method to further improve its performance. SemRe-Rank incorporates word embeddings into a personalised PageRank process to compute 'semantic importance' scores for candidate terms from a graph of semantically related words (nodes), which are then used to revise the scores of candidate terms computed by a base ATE algorithm. Extensively evaluated with 13 state-of-the-art base ATE methods on four datasets of diverse nature, it is shown to have achieved widespread improvement over all base methods and across all datasets, with up to 15 percentage points when measured by the Precision in the top ranked K candidate terms (the average for a set of K's), or up to 28 percentage points in F1 measured at a K that equals to the expected real terms in the candidates (F1 in short). Compared to an alternative approach built on the well-known TextRank algorithm, SemRe-Rank can potentially outperform by up to 8 points in Precision at top K, or up to 17 points in F1.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.