Neural networks are increasingly used in recognition problems, including static and moving images, sounds, etc. Unfortunately, the selection of optimal neural network architecture for a specific recognition problem is a difficult task, which often has an experimental nature. In this paper we present the use of evolutionary algorithms to obtain optimal architectures of neural networks used for audio sample classification. We extend the Pytorch DNN Evolution tool implementing coevolutionary algorithms which create groups of neural networks that solve a given problem with a certain accuracy, with the support for problems in which training data consists of audio samples. In this paper we use the co-evolutionary approach to solve a sample sound classification problem. We describe how the sound data was prepared for processing with the use of the Mel Frequency Cepstral Coefficients (MFCC). Next we present the results of experiments conducted with the AudioMnist dataset. The obtained neural network architectures, whose classification accuracy is comparable to the classification accuracy attained by the AlexNet neural network, and their implications are discussed.