Network representation learning (NRL) aims to learn low-dimensional vectors for vertices in a network. Most existing NRL methods focus on learning representations from local context of vertices (such as their neighbors). Nevertheless, vertices in many complex networks also exhibit significant global patterns widely known as communities. It's intuitive that vertices in the same community tend to connect densely and share common attributes. These patterns are expected to improve NRL and benefit relevant evaluation tasks, such as link prediction and vertex classification. Inspired by the analogy between network representation learning and text modeling, we propose a unified NRL framework by introducing community information of vertices, named as Community-enhanced Network Representation Learning (CNRL). CNRL simultaneously detects community distribution of each vertex and learns embeddings of both vertices and communities. Moreover, the proposed community enhancement mechanism can be applied to various existing NRL models. In experiments, we evaluate our model on vertex classification, link prediction, and community detection using several real-world datasets. The results demonstrate that CNRL significantly and consistently outperforms other state-of-the-art methods while verifying our assumptions on the correlations between vertices and communities.
While neural machine translation (NMT) achieves remarkable performance on clean, indomain text, performance is known to degrade drastically when facing text which is full of typos, grammatical errors and other varieties of noise. In this work, we propose a multitask learning algorithm for transformer-based MT systems that is more resilient to this noise. We describe our submission to the WMT 2019 Robustness shared task (Li et al., 2019) based on this method. Our model achieves a BLEU score of 32.8 on the shared task French to English dataset, which is 7.1 BLEU points higher than the baseline vanilla transformer trained with clean text 1 .
Active learning (AL) for machine translation (MT) has been well-studied for the phrasebased MT paradigm. Several AL algorithms for data sampling have been proposed over the years. However, given the rapid advancement in neural methods, these algorithms have not been thoroughly investigated in the context of neural MT (NMT). In this work, we address this missing aspect by conducting a systematic comparison of different AL methods in a simulated AL framework. Our experimental setup to compare different AL methods uses: i) State-of-the-art NMT architecture to achieve realistic results; and ii) the same dataset (WMT'13 English-Spanish) to have fair comparison across different methods. We then demonstrate how recent advancements in unsupervised pre-training and paraphrastic embedding can be used to improve existing AL methods. Finally, we propose a neural extension for an AL sampling method used in the context of phrase-based MT-Round Trip Translation Likelihood (RTTL). RTTL uses a bidirectional translation model to estimate the loss of information during translation and outperforms previous methods.
Linguistic Inquiry and Word Count (LIWC) is a word counting software tool which has been used for quantitative text analysis in many fields. Due to its success and popularity, the core lexicon has been translated into Chinese and many other languages. However, the lexicon only contains several thousand of words, which is deficient compared with the number of common words in Chinese. Current approaches often require manually expanding the lexicon, but it often takes too much time and requires linguistic experts to extend the lexicon. To address this issue, we propose to expand the LIWC lexicon automatically. Specifically, we consider it as a hierarchical classification problem and utilize the Sequence-to-Sequence model to classify words in the lexicon. Moreover, we use the sememe information with the attention mechanism to capture the exact meanings of a word, so that we can expand a more precise and comprehensive lexicon. The experimental results show that our model has a better understanding of word meanings with the help of sememes and achieves significant and consistent improvements compared with the state-of-the-art methods. The source code of this paper can be obtained from https://github.com/thunlp/Auto_CLIWC.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.