Most successful CNN architectures are deep networks, and their intensive memory and processing requirements have made it difficult to deploy them to microcontrollers or other real-time systems in which memory footprint and power consumption may not be neglected. Consequently, many efforts have been made to adapt deep networks to such contexts, either through hardware specialization and optimization or through architectural modifications. This paper concentrates on the application of CNNs in the field of voice recognition and verification, marked by the works of Simonyan and Zisserman and their proposed VGGNet neural network. As the mentioned model is commonly trained on an audio-visual dataset known as VoxCeleb, this paper further evaluates and improve its performance on a challenging Chinese-speaking audio dataset collected in various media with little preprocessing. Quantization is used to reduce its memory usage and to gear it towards edge computing applications.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.