Sentence representation at the semantic level is a challenging task for Natural Language Processing and Artificial Intelligence. Despite the advances in word embeddings (i.e. word vector representations), capturing sentence meaning is an open question due to complexities of semantic interactions among words. In this paper, we present an embedding method, which is aimed at learning unsupervised sentence representations from unlabeled text. We propose an unsupervised method that models a sentence as a weighted series of word embeddings. The weights of the word embeddings are fitted by using Shannon's word entropies provided by the Term Frequency-Inverse Document Frequency (TF-IDF) transform. The hyperparameters of the model can be selected according to the properties of data (e.g. sentence length and textual gender). Hyperparameter selection involves word embedding methods and dimensionalities, as well as weighting schemata. Our method offers advantages over existing methods: identifiable modules, short-term training, online inference of (unseen) sentence representations, as well as independence from domain, external knowledge and language resources. Results showed that our model outperformed the state of the art in well-known Semantic Textual Similarity (STS) benchmarks. Moreover, our model reached state-of-the-art performance when compared to supervised and knowledge-based STS systems.
Abstract. The exponential growth of the Internet has allowed the development of a market of on-line job search sites. This paper aims at presenting the E-Gen system (Automatic Job Offer Processing system for Human Resources). E-Gen will implement two complex tasks: an analysis and categorisation of job postings, which are unstructured text documents (e-mails of job listings possibly with an attached document), an analysis and a relevance ranking of the candidate answers (cover letter and curriculum vitae). This paper aims to present a strategy to resolve the first task: after a process of filtering and lemmatisation, we use vectorial representation before generating a classification with Support Vector Machines. This first classification is afterwards transmitted to a �correc-tive� post-process which improves the quality of the solution.
� IntroductionThe exponential growth of the Internet has allowed the developement of an online job-search sites market [1][2][3]. The mass of information obtained through candidate response represents a lot of information that is difficult for companies to manage [4][5][6]. It is therefore indispensable to process this information by an automatic or assisted way. The Laboratoire Informatique d'�vignon (LIA) and Aktor Interactive have developed the E-Gen system in order to resolve this problem. It will be composed of two main modules: 1. A module to extract information from a corpora of e-mails containing job descriptions. 2. A module to analyse and compute a relevance ranking of the candidate answers (cover letter and curriculum vitae).In order to extract useful information, the system analyses the contents of the e-mails containing job descriptions. In this step, there are many difficulties and �� Corresponding author.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.