This article discusses the process of automatically building Arabic multi-dialect speech corpora using Voice over Internet Protocol (VoIP). The Asterisk framework was adopted to act as the main connection between the parties, for which two virtual machines were created: a sender and a receiver. The sender makes a VoIP call to the receiver using the Asterisk framework, while the receiver records the call automatically, a process that is repeated for all the audio files involved in the corpora. In this work, more than 67,000 automatic calls were made between the sender and receiver machines, generating VoIP Arabic corpora for four Arabic dialects. The resulting corpora can be considered the first Arabic VoIP parallel speech corpora and will be made freely available to researchers in Arabic NLP and speech recognition research.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.