With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, Web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.
Among all animals, mosquitoes are responsible for the most deaths worldwide. Interestingly, not all types of mosquitoes spread diseases, but rather, a select few alone are competent enough to do so. In the case of any disease outbreak, an important first step is surveillance of vectors (i.e., those mosquitoes capable of spreading diseases). To do this today, public health workers lay several mosquito traps in the area of interest. Hundreds of mosquitoes will get trapped. Naturally, among these hundreds, taxonomists have to identify only the vectors to gauge their density. This process today is manual, requires complex expertise/ training, and is based on visual inspection of each trapped specimen under a microscope. It is long, stressful and self-limiting. This paper presents an innovative solution to this problem. Our technique assumes the presence of an embedded camera (similar to those in smart-phones) that can take pictures of trapped mosquitoes. Our techniques proposed here will then process these images to automatically classify the genus and species type. Our CNN model based on Inception-ResNet V2 and Transfer Learning yielded an overall accuracy of 80% in classifying mosquitoes when trained on 25, 867 images of 250 trapped mosquito vector specimens captured via many smart-phone cameras. In particular, the accuracy of our model in classifying Aedes aegypti and Anopheles stephensi mosquitoes (both of which are deadly vectors) is amongst the highest. We present important lessons learned and practical impact of our techniques towards the end of the paper.
Transformer-based language models (LMs) continue to advance state-of-the-art performance on NLP benchmark tasks, including tasks designed to mimic human-inspired "commonsense" competencies. To better understand the degree to which LMs can be said to have certain linguistic reasoning skills, researchers are beginning to adapt the tools and concepts of the field of psychometrics. But to what extent can the benefits flow in the other direction? I.e., can LMs be of use in predicting what the psychometric properties of test items will be when those items are given to human participants? We gather responses from numerous human participants and LMs (transformerand non-transformer-based) on a broad diagnostic test of linguistic competencies. We then use the responses to calculate standard psychometric properties of the items in the diagnostic test, using the human responses and the LM responses separately. We then determine how well these two sets of predictions match. We find cases in which transformerbased LMs predict psychometric properties consistently well in certain categories but consistently poorly in others, thus providing new insights into fundamental similarities and differences between human and LM reasoning. 1
Recent advances in neural machine translation (NMT) have pushed the quality of machine translation systems to the point where they are becoming widely adopted for building competitive systems. However, there is still a large number of languages that are yet to reap the benefits of NMT. In this paper, we provide the first large-scale case study of the practical application of MT in the Turkic language family in order to realize the gains of NMT for Turkic languages under high-resource to extremely low-resource scenarios. In addition to presenting an extensive analysis that identifies the bottlenecks towards building competitive systems to ameliorate data scarcity, our study has several key contributions, including, i) a large parallel corpus covering 22 Turkic languages consisting of common public datasets in combination with new datasets of approximately 2 million parallel sentences, ii) bilingual baselines for 26 language pairs, iii) novel high-quality test sets in three different translation domains and iv) human evaluation scores. All of our data, software and models are publicly available. 1
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.