The purpose of this paper is to describe a new version of the Spoken English Corpus which will be of interest to phoneticians and other speech scientists. The Spoken English Corpus is a well-known collection of spoken-language texts that was collected and transcribed in the 1980's in a joint project involving IBM UK and the University of Lancaster (Alderson and Knowles forthcoming, Knowles and Taylor 1988). One valuable aspect of it is that the recorded material on which it was based is fairly freely available and the recording quality is generally good. At the time when the recordings were made, the idea of storing all the recorded material in digital form suitable for computer processing was of limited practicality. Although storage on digital tape was certainly feasible, this did not provide rapid computer access. The arrival of optical disk technology, with the possibility of storing very large amounts of digital data on a compact disk at relatively low cost, has brought about a revolution in ideas on database construction and use. It seemed to us that the recordings of the Spoken English Corpus (hereafter SEC) should now be converted into a form which would enable the user to gain access to the acoustic signal without the laborious business of winding through large amounts of tape. Once this was done, we should be able not only to listen to the recordings in a very convenient way, but also to carry out many automatic analyses of the material by computer.
Abstract-This paper presents the methods and results of a project that collects and analyses public comments written in response to political posts on Facebook using natural language processing and social psychological methods in order to explore emotional attitudes and social behavior.
In our research we have created a text summarization software tool for Hungarian using multilingual and Hungarian BERT-based models. Two types of text summarization method exist: abstractive and extractive. The abstractive summarization is more similar to human generated summarization. Target summaries may include phrases that the original text does not necessarily contain. This method generates the summarized text by applying keywords that were extracted from the original text. The extractive method summarizes the text by using the most important extracted phrases or sentences from the original text. In our research we have built both abstractive and extractive models for Hungarian. For abstractive models, we have used a multilingual BERT model and Hungarian monolingual BERT models. For extractive summarization, in addition to the BERT models, we have also made experiments with ELECTRA models. We find that the Hungarian monolingual models outperformed the multilingual BERT model in all cases. Furthermore, the ELECTRA small models achieved higher results than some of the BERT models. This result is important because the ELECTRA small models have much fewer parameters and were trained on only 1 GPU within a couple of days. Another important consideration is that the ELECTRA
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.