The Armenian language is an independent branch of the Indo-European language family and the official language of the Republic of Armenia and the Republic of Artsakh. According to various reliable sources, an average of 3 million people in Armenia and 10-12 million people in the Armenian Diaspora use the Armenian language as their native language. The largest communities outside of Armenia are in the United States of America, Canada, the Russian Federation, the Islamic Republic of Iran, the French Republic, the Syrian Arab Republic and the Lebanese Republic. This paper presents the ArmSpeech speech corpus. ArmSpeech is a collection of annotated Armenian speech intended for natural language processing (NLP) technologies research and development. ArmSpeech is designed for speech-to-text and text-to-speech purposes but can be used in other domains also (e.g. language identification). Corpus contains 6206 high-quality audio samples: 11 hours 46 minutes and 26 seconds (11.77 hours) of annotated native Armenian speech of multiple speakers of any age, gender and accent. According to the research results, this is the most extensive Armenian speech corpus in the public domain for speech recognition, speech synthesis and spoken language identification systems.
The first paper of ArmSpeech presented an annotative native Armenian speech corpus, its data collection, preprocessing and annotation processes, corpus structure and statistics. The main reason for ArmSpeech creation is to increase Armenian language research resources because according to research there are no free or paid Armenian speech corpora for speech-to-text, text-to-speech and language research. From an NLP perspective, the Armenian language is a low-resourced language despite the fact that The Armenian language is an independent branch of the Indo-European language family and the native language of 12-15 million people. ArmSpeech corpus can be used in natural language processing (NLP) research. The first release of the corpus mainly contains audio clips extracted from free-to-use audiobooks. The total duration of audio clips is 11.77 hours. ArmSpeech’s first release corpus includes 6206 audio clips of multiple speakers of any age, gender and accent. This paper intends to present the ArmSpeech extended version, which is a continuation of the previous work, includes an annotated Armenian speech, and the recording process is based on the volunteer’s voice donation principle. The paper also introduces necessary data collection, pre-processing, recording and annotation stages, final results and statistics of the corpus. The material (text) needed for the recording was collected from the articles on Armenian news websites about lifestyle, culture, sport and politics․ Recording was done by 1 female and 3 male volunteers whose native language is Armenian. The total duration of the data included in the second release is approximately 4 hours and along with the first release, the ArmSpeech corpus becomes 15.7 hours.
Text chunking, Part-of-speech (POS) tagging, and named entity recognition (NER) are fundamental tasks in natural language processing (NLP). Part-of-speech (POS) tagging involves assigning grammatical labels to words in a sentence. Research shows that Armenian is a low-resourced language and there are not enough materials for developing higher accurate part-of-speech tagging systems in the Armenian language. This paper presents a fresh dataset for POS tagging in Armenian that follows the naming conventions of both Penn Treebank and Universal Dependencies tagsets, with two versions available. The dataset consists of 6081 sentences that were automatically annotated and then manually verified. The data was sourced from Armenian news websites, focusing on topics such as culture, medicine, and lifestyle, as well as 22 Armenian fairytales. The reason for having two versions of the POS tagset was to ensure compatibility and integration with all-natural language processing tools and models that use these standards. By standardizing the tagset, it becomes easier to compare and evaluate the effectiveness of different POS tagging models. The paper also describes data collection, cleaning, preprocessing, and processing steps. The ISMA translator was used for the annotation of the dataset, which not only performs machine translation but also conducts a syntactic and semantic analysis of the text and assigns a POS tag for each word in the sentence. The final corpus contains 13 groups of part-of-speech tags and a total of 57160 tagged tokens including the distinction between singular and plural parts of speech.
Eye-related research has shown that eye gaze data are very important for applications that are essential to human daily life. Eye gaze data has been used in research and systems for eye movements, eye tracking and eye gaze tracking. Eye pupil localization, labelling and tracking are challenging problems in computer science. This article discusses and explores that problem․ The YOLOv4 (“You only look once”) object detection algorithm which is an evolution of the YOLOv3 model has been evaluated in a tiny database consisting of 103 eye images. The YOLOv4 algorithm was created by Alexey Bochkovskiy, Chien-Yao Wang and Hong-Yuan Mark Liao [3]. It is twice as fast as EfficientDet with comparable performance. The main purpose of this article is to test the YOLOv4 algorithm, to find out its effectiveness in process of localization and labelling of eye pupils and find out (determine) the effectiveness of the algorithm when training with a tiny database and with a relatively small number of iterations.
Nowadays automatic speech recognition (ASR) is an important task for machines. Several applications such as speech translation, virtual assistants and voice bot systems use ASR to understand human speech. Most of the research and available models are for widely used languages, such as English, German, French, Chinese and Spanish. This paper presents the Armenian speech recognition system. As a result of this research developed acoustic and language models for the Armenian language (modern ASR systems combine acoustic and language models to achieve higher accuracy). RNN-based Baidu’s Deep Speech deep neural network was used to train the acoustic model, and the KenLM toolkit was used to train the probabilistic language model. The acoustic model was trained and validated on ArmSpeech Armenian native speech corpus using transfer-learning and data augmentation techniques and tested on the Common Voice Armenian database. The language model was built based on the texts scraped from Armenian news websites. Final models are small in size and can be run and do real-time speech-to-text tasks on IoT devices. Testing on the Common Voice Armenian database the model gave 0.902565 WER and 0.305321 CER without the language model, and 0.552975 WER and 0.285904 CER with the language model. The paper aims to describe environment setup, data collection, acoustic and language models training processes, as well as final results and benchmarks.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.