This paper is a position statement on reproducible research in linguistics, including data citation and attribution, that represents the collective views of some 41 colleagues. Reproducibility can play a key role in increasing verification and accountability in linguistic research, and is a hallmark of social science research that is currently under-represented in our field. We believe that we need to take time as a discipline to clearly articulate our expectations for how linguistic data are managed, cited, and maintained for long-term access.
Daniel Macdonald, a Presbyterian Church of Victoria missionary to the New Hebrides from 1872 to 1905, developed a particularly strong interest in language. A prodigious author, he published widely and at length on the languages of Efate, and especially those of the Havannah Harbour area where he was stationed. But if his work is recalled today, it is as something of a curio, both for his insistence—archaic even for the times—on a link between ancient Semitic and Efate, and for his vigorous promotion of the use by the mission and its converts of a single, hybrid Efate language. This paper addresses and seeks to analyze what Macdonald himself called this “compromise literary dialect.” By identifying distinctive features of the three main varieties of Efate languages known today (Nguna or Nakanamanga, South Efate, and Lelepa), we aim to move beyond the lexical comparisons that have been the sole means of gauging relationships among these languages thus far. This enables us to begin the process of investigating the claim of Captain Rason, British Deputy Commissioner for the New Hebrides during Macdonald’s last years on Efate, that the “compromise literary dialect” was in fact a spoken dialect particular to the area of Havannah Harbour. We hope to reconsider and perhaps recuperate some of Macdonald’s writing as a rare if often distorted window on indigenous life and language at a pivotal moment in the transformation of Efate communities.
Machine learning has revolutionised speech technologies for major world languages, but these technologies have generally not been available for the roughly 4,000 languages with populations of fewer than 10,000 speakers. This paper describes the development of Elpis, a pipeline which language documentation workers with minimal computational experience can use to build their own speech recognition models, resulting in models being built for 16 languages from the Asia-Pacific region. Elpis puts machine learning speech technologies within reach of people working with languages with scarce data, in a scalable way. This is impactful since it enables language communities to cross the digital divide, and speeds up language documentation. Complete automation of the process is not feasible for languages with small quantities of data and potentially large vocabularies. Hence our goal is not full automation, but rather to make a practical and effective workflow that integrates machine learning technologies.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.