Like Hui scite author profile

Modern neural architectures for classification tasks are trained using the crossentropy loss, which is believed to be empirically superior to the square loss. In this work we provide evidence indicating that this belief may not be well-founded. We explore several major neural architectures and a range of standard benchmark datasets for NLP, automatic speech recognition (ASR) and computer vision tasks to show that these architectures, with the same hyper-parameter settings as reported in the literature, perform comparably or better when trained with the square loss, even after equalizing computational resources. Indeed, we observe that the square loss produces better results in the dominant majority of NLP and ASR experiments. Cross-entropy appears to have a slight edge on computer vision tasks. We argue that there is little compelling empirical or theoretical evidence indicating a clear-cut advantage to the cross-entropy loss. Indeed, in our experiments, performance on nearly all non-vision tasks can be improved, sometimes significantly, by switching to the square loss. We posit that training using the square loss for classification needs to be a part of best practices of modern deep learning on equal footing with cross-entropy. 1 We note WSJ and Librispeech datasets have two separate classification tasks in terms of the evaluation metrics, based on the same learned acoustic model. We choose to count them as separate tasks.Preprint. Under review.

show abstract

On Capacity of Downlink Underwater Wireless Optical MIMO Systems With Random Sea Surface

Zhang

Dong

Hui

2015

IEEE Commun. Lett.

View full text Add to dashboard Cite

Convolutional maxout neural networks for speech separation

Hui

Cai

Guo

et al. 2015

View full text Add to dashboard Cite

Kernel Machines Beat Deep Neural Networks on Mask-Based Single-Channel Speech Enhancement

Hui

Belkin

2019

View full text Add to dashboard Cite

We apply a fast kernel method for mask-based single-channel speech enhancement. Specifically, our method solves a kernel regression problem associated to a non-smooth kernel function (exponential power kernel) with a highly efficient iterative method (EigenPro). Due to the simplicity of this method, its hyper-parameters such as kernel bandwidth can be automatically and efficiently selected using line search with subsamples of training data. We observe an empirical correlation between the regression loss (mean square error) and regular metrics for speech enhancement. This observation justifies our training target and motivates us to achieve lower regression loss by training separate kernel model per frequency subband. We compare our method with the state-of-the-art deep neural networks on mask-based HINT and TIMIT. Experimental results show that our kernel method consistently outperforms deep neural networks while requiring less training time.Index Termslarge-scale kernel machines, deep neural networks, speech enhancement, exponential power kernel, automatic hyper-parameter selection

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Like Hui

Joint Training of Complex Ratio Mask Based Beamformer and Acoustic Model for Noise Robust Asr

Evaluation of Neural Architectures Trained with Square Loss vs Cross-Entropy in Classification Tasks

On Capacity of Downlink Underwater Wireless Optical MIMO Systems With Random Sea Surface

Convolutional maxout neural networks for speech separation

Kernel Machines Beat Deep Neural Networks on Mask-Based Single-Channel Speech Enhancement

Contact Info

Product

Resources

About