Monaural speech separation is a fundamental problem in robust speech
processing. Recently, deep neural network (DNN)-based speech separation methods,
which predict either clean speech or an ideal time-frequency mask, have
demonstrated remarkable performance improvement. However, a single DNN with a
given window length does not leverage contextual information sufficiently, and
the differences between the two optimization objectives are not well understood.
In this paper, we propose a deep ensemble method, named multicontext networks,
to address monaural speech separation. The first multicontext network averages
the outputs of multiple DNNs whose inputs employ different window lengths. The
second multicontext network is a stack of multiple DNNs. Each DNN in a module of
the stack takes the concatenation of original acoustic features and expansion of
the soft output of the lower module as its input, and predicts the ratio mask of
the target speaker; the DNNs in the same module employ different contexts. We
have conducted extensive experiments with three speech corpora. The results
demonstrate the effectiveness of the proposed method. We have also compared the
two optimization objectives systematically and found that predicting the ideal
time-frequency mask is more efficient in utilizing clean training speech, while
predicting clean speech is less sensitive to SNR variations.