This paper presents the submission of our team (NORMAS) to the SemEval 2016 semantic textual similarity (STS) shared task. We submitted three system runs, each using a set of 36 features extracted from the training set. The runs explore the use of the following three machine learning algorithms: Support Vector Regression, Elastic Net and Random Forest. Each run was trained using sentence pairs from the STS 2012 training data. Features extracted include lexical, syntactic and semantic features. This paper describes the features we designed for assessing the semantic similarity between sentence pairs, the models we build using these features and the performance obtained by the resulting systems on the 2016 evaluation data.
Job seekers find themselves increasingly duped and misled by fraudulent job advertisements, posing a threat to their privacy, security and well-being. There is a clear need for solutions that can protect innocent job seekers. Existing approaches to detecting fraudulent jobs do not scale well, function like a black-box, and lack interpretability, which is essential to guide applicants’ decision-making. Moreover, commonly used lexical features may be insufficient as the representation does not capture contextual semantics of the underlying document. Hence, this paper explores to what extent different categorizations of fraudulent jobs can be classified. In addition, this paper seeks to find what type of features are most relevant in classifying the type of fraudulent job. In this paper, we develop and validate a machine learning system for identifying identity theft, corporate identity theft and multi-level marketing amongst fraudulent job advertisements. We utilized four classes of features: empirical rule set-based features, bag-of-word models, most recent state-of-the-art word embeddings and transformer models for various machine learning classifiers. The machine learning models were validated by evaluating them on a publicly available job description dataset. Our results indicate that the word embeddings and transformer-based features consistently outperformed the handcrafted rule-set based features class. Ultimately, a Gradient Boosting classifier with a combination of empirical rule-set based features, parts-of-speech tags and bag-of-words vectors achieved the best performance with an F1-score of 0.88.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.