The present article explores two novel methods that integrate distributed representations with terminology extraction. Both methods assess the specificity of a word (unigram) to the target corpus by leveraging its distributed representation in the target domain as well as in the general domain. The first approach adopts this distributed specificity as a filter, and the second directly applies it to the corpus. The filter can be mounted on any other Automatic Terminology Extraction (ATE) method, allows merging any number of other ATE methods, and achieves remarkable results with minimal training. The direct approach does not perform as high as the filtering approach, but it reemphasizes that using distributed specificity as the words’ representation, very little data is required to train an ATE classifier. This encourages more minimally supervised ATE algorithms in the future.
Synthetic data, artificially generated by computer programs, has become more widely used in the financial domain to mitigate privacy concerns. Variational Autoencoder (VAE) is one of the most popular deep-learning models for generating synthetic data. However, VAE is often considered a “black box” due to its opaqueness. Although some studies have been conducted to provide explanatory insights into VAE, research focusing on explaining how the input data could influence VAE to create synthetic data, especially for tabular data, is still lacking. However, in the financial industry, most data are stored in a tabular format. This paper proposes a sensitivity-based method to assess the impact of inputted tabular data on how VAE synthesizes data. This sensitivity-based method can provide both global and local interpretations efficiently and intuitively. To test this method, a simulated dataset and three Kaggle banking tabular datasets were employed. The results confirmed the applicability of this proposed method.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.