Comparing artificial neural networks (ANNs) with outputs of brain imaging techniques has recently seen substantial advances in (computer) vision and text-based language models. Here, we propose a framework to compare biological and artificial neural computations of spoken language representations and propose several new challenges to this paradigm. Using a technique proposed by Begus and Zhou (2021b), we can analyze encoding of any acoustic property in intermediate convolutional layers of an artificial neural network. This allows us to test similarities in speech encoding between the brain and artificial neural networks in a way that is more interpretable than the majority of existing proposals that focus on correlations and supervised models. We introduce fully unsupervised deep generative models (the Generative Adversarial Network architecture) trained on raw speech to the brain-and-ANN-comparison paradigm, which enable testing of both the production and perception principles in human speech. We present a framework that parallels electrophysiological experiments measuring complex Auditory Brainstem Response (cABR) in human brain with intermediate layers in deep convolutional networks. We compared peak latency in cABR relative to the stimulus in the brain stem experiment, and in intermediate convolutional layers relative to the input/output in deep convolutional networks. We also examined and compared the effect of prior language exposure on the peak latency in cABR, and in intermediate convolutional layers of a phonetic property. Specifically, the phonetic property (i.e., VOT =10 ms) is perceived differently by English vs. Spanish speakers as voiced (e.g. [ba]) vs voiceless (e.g. [pa]). Critically, the cABR peak latency to the VOT phonetic property is different between English and Spanish speakers, and peak latency in intermediate convolutional layers is different between English-trained and Spanish-trained computational models. Substantial similarities in peak latency encoding between the human brain and intermediate convolutional networks emerge based on results from eight trained networks (including a replication experiment). The proposed technique can be used to compare encoding between the human brain and intermediate convolutional layers for any acoustic property.