Neural Networks Classifier for Data Selection in Statistical Machine Translation

Peris, Álvaro; Chinea-Rios, Mara; Casacuberta, Francisco

doi:10.1515/pralin-2017-0027

Cited by 9 publications

(6 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Other related works on domain adaptation include Dou et al (2019a) that adapts multi-domain NMT models with domain-aware feature embeddings, which are learned via an auxiliary language modeling task. Peris et al (2017) proposed neuralnetwork based classifiers for data selection in SMT. For more related work on data selection and domain adaptation in the context of MT, see the surveys by Eetemadi et al (2015) for SMT and more recently Chu and Wang (2018) for NMT.…”

Section: Related Workmentioning

confidence: 99%

Unsupervised Domain Clusters in Pretrained Language Models

Aharoni¹,

Goldberg²

2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

126

105

View full text Add to dashboard Cite

The notion of "in-domain data" in NLP is often over-simplistic and vague, as textual data varies in many nuanced linguistic aspects such as topic, style or level of formality. In addition, domain labels are many times unavailable, making it challenging to build domainspecific systems. We show that massive pretrained language models implicitly learn sentence representations that cluster by domains without supervision -suggesting a simple datadriven definition of domains in textual data. We harness this property and propose domain data selection methods based on such models, which require only a small set of in-domain monolingual data. We evaluate our data selection methods for neural machine translation across five diverse domains, where they outperform an established approach as measured by both BLEU and by precision and recall of sentence selection with respect to an oracle.

show abstract

Section: Related Workmentioning

confidence: 99%

Unsupervised Domain Clusters in Pretrained Language Models

Aharoni¹,

Goldberg²

2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

126

105

View full text Add to dashboard Cite

show abstract

“…The main distinction is that they used neural language models for selection rather than n-gram models. , Peris et al (2017), and selected based on convolutional and bidirectional long short-term memory neural networks.…”

Section: Related Workmentioning

confidence: 99%

Accelerating Text Communication via Abbreviated Sentence Input

Adhikary¹,

Berger²,

Vertanen³

2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

Typing every character in a text message may require more time or effort than strictly necessary. Skipping spaces or other characters may be able to speed input and reduce a user's physical input effort. This can be particularly important for people with motor impairments. In a large crowdsourced study, we found workers frequently abbreviated text by omitting midword vowels. We designed a recognizer optimized for expanding noisy abbreviated input where users often omit spaces and mid-word vowels. We show using neural language models for selecting conversational-style training text and for rescoring the recognizer's n-best sentences improved accuracy. On noisy touchscreen data collected from hundreds of users, we found accurate abbreviated input was possible even if a third of characters was omitted. Finally, in a study where users had to dwell for a second on each key, sentence abbreviated input was competitive with a conventional keyboard with word predictions.

show abstract

“…They also use n-gram (n=4) based language models for representing the sentences. Peris et al [25] used neural network-based classifiers to select data for the machine translation task. Recently, Gururangan et al [14] used unsupervised data selection for increasing training data in a lowresource scenario.…”

Section: Related Workmentioning

confidence: 99%

UDON: Unsupervised Data SelectiON for Biomedical Entity Recognition

Akdemir

Shibuya

2021

2021 4th International Conference on Computing and Big Data

View full text Add to dashboard Cite

High-quality training datasets are critical for building successful Machine Learning (ML) based NLP systems. However, these datasets are not always available in low-resource contexts such as the biomedical domain. Here, selecting relevant training data is as important as the choice of the ML model. In this study we propose UDON: Unsupervised Data selectiON for biomedical entity recognition using domain-specific pretrained Language Models (LMs). We first show that pretrained LMs succeed at implicitly learning the differences between datasets without any supervision, and then use these models to select relevant data instances. Next, we evaluate the proposed methods for entity recognition on seven biomedical datasets and one news domain dataset using four LMs and three selection methods. Our results show that using pretrained domainspecific LMs for data selection outperforms all other approaches. Finally, we use domain classification as an auxiliary task for pretraining the neural network on the in-domain dataset and show this yields further improvements.

show abstract

Neural Networks Classifier for Data Selection in Statistical Machine Translation

Cited by 9 publications

References 12 publications

Unsupervised Domain Clusters in Pretrained Language Models

Unsupervised Domain Clusters in Pretrained Language Models

Accelerating Text Communication via Abbreviated Sentence Input

UDON: Unsupervised Data SelectiON for Biomedical Entity Recognition

Contact Info

Product

Resources

About