Speech separation is the task of separating target speech from background
interference. Traditionally, speech separation is studied as a signal processing
problem. A more recent approach formulates speech separation as a supervised
learning problem, where the discriminative patterns of speech, speakers, and
background noise are learned from training data. Over the past decade, many
supervised separation algorithms have been put forward. In particular, the
recent introduction of deep learning to supervised speech separation has
dramatically accelerated progress and boosted separation performance. This paper
provides a comprehensive overview of the research on deep learning based
supervised speech separation in the last several years. We first introduce the
background of speech separation and the formulation of supervised separation.
Then, we discuss three main components of supervised separation: learning
machines, training targets, and acoustic features. Much of the overview is on
separation algorithms where we review monaural methods, including speech
enhancement (speech-nonspeech separation), speaker separation (multitalker
separation), and speech dereverberation, as well as multimicrophone techniques.
The important issue of generalization, unique to supervised learning, is
discussed. This overview provides a historical perspective on how advances are
made. In addition, we discuss a number of conceptual issues, including what
constitutes the target source.