A common method for text-analysis and text-based classification is to process for term-frequency or patterns of terms. However, these features alone may not be able to differentiate fake and authentic job advertisements. Thus, in this work, we proposed a method to detect fake job recruitments using a novel set of features designed to reflect the behavior of fraudsters who present fake information. The features were missing information, exaggeration, and credibility. The features were designed to represent in the form of a category and an automatically generatable score of readability. Data from EMSCAD dataset were transformed in accordance with the designed features and used to train a detection model for fake job detection. The experimental results showed that the model from the designed features performed better than those based on the term-frequency approach in every applied machine learning technique. The proposed method yielded 97.64% accuracy, 0.97 precision and 0.99 recall score for its best model when used for classifying fake job advertisements.
Character-based word segmentation models have been extensively applied to Asian languages, including Thai, owing to their promising performance. These models estimate the word boundaries from a character sequence; however, a Thai character unit in a sequence has no inherent meaning, in contrast with word, subword, and character cluster units that represent more meaningful linguistic information. In this paper, we propose a Thai word segmentation model that uses various types of information, including words, subwords, and character clusters, from a character sequence. Our model applies multiple attentions to refine segmentation inferences by estimating the significant relationships among characters and various unit types. We evaluated our model on three Thai datasets, and the experimental results show that our model outperforms other Thai word segmentation models, demonstrating the validity of using character clusters over subword units. A case study on sample Thai text supported these results. Thus, according to our analysis, particularly the case study, our model can segment Thai text accurately, while other existing models yield incorrect results that violate the Thai writing system.
Despite the fact that there are a number of researches working on Khmer Language in the field of Natural Language Processing along with some resources regarding words segmentation and POS Tagging, we still lack of high-level resources regarding syntax, Treebanks and grammars, for example. This paper illustrates the semi-automatic framework of constructing Khmer Treebank and the extraction of the Khmer grammar rules from a set of sentences taken from the Khmer grammar books. Initially, these sentences will be manually annotated and processed to generate a number of grammar rules with their probabilities once the Treebank is obtained. In our experiments, the annotated trees and the extracted grammar rules are analyzed in both quantitative and qualitative way. Finally, the results will be evaluated in three evaluation processes including Self-Consistency, 5-Fold Cross-Validation, Leave-One-Out Cross-Validation along with the three validation methods such as Precision, Recall, F1-Measure. According to the result of the three validations, Self-Consistency has shown the best result with more than 92%, followed by the Leave-One-Out Cross-Validation and 5-Fold Cross Validation with the average of 88% and 75% respectively. On the other hand, the crossing bracket data shows that Leave-One-Out Cross Validation holds the highest average with 96% while the other two are 85% and 89%, respectively.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.