In this work, we present a baseline end-to-end system based on deep learning for automatic speech recognition in Brazilian Portuguese. To build such a model, we employ a speech corpus containing 158 hours of annotated speech by assembling four individual datasets, three of them publicly available, and a text corpus containing 10.2 millions of sentences. We train an acoustic model based on the DeepSpeech 2 network, with two convolutional and five bidirectional recurrent layers. By adding a newly trained 15-gram language model at the character level, we achieve a character error rate of only 10.49% and a word error rate of 25.45%, which are on a par with other works in different languages using a similar amount of training data.
This work presents an open code deep-learningbased system to perform facial recognition. The system is composed of five main steps: face segmentation, facial features detection, face alignment, embedding, and classification. We use using deep learning methods for the fiducial points extraction and embedding. Support Vector Machine (SVM) is used for classification task since it is fast for both training and inference. The system achieves an error rate of 0.12103 for facial features detection, which is pretty close to state of the art algorithms, and 0.05 for face recognition. Besides, it is capable to run in real-time.
Abstract-Automated traffic monitoring is getting more and more important as the number of vehicles in circulation grows. Nevertheless, traffic control is still predominantly done manually using video cameras. In this work, we extensively analyze our previous collaborative and opportunistic traffic monitoring system to evaluate the proposal on scenarios with more than one vehicle. Based on the information received by IEEE 802.11 beacon frames, vehicles provide its location by a central entity to handle and disseminate information about traffic conditions on urban roads, exploiting readily available network resources. Experiments performed in the ns-3 conducted via simulations demonstrate the possibility to infer traffic conditions using a simple architecture and generating a small quantity of traffic on network.
We propose a voice conversion system leveraging recent developments in both voice synthesis and image morphing, which uses CycleGAN to convert mel-spectrograms and neural vocoders to synthesize the converted signals. To evaluate how different vocoders perform in the task, we synthesize converted melspectrograms using WaveNet, WaveRNN and MelGAN vocoders. We compare their performances via listening tests, finding that MelGAN and WaveRNN obtained comparable results while WaveNet obtained worse results for converted speech.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.