Monaural speech separation is a very challenging problem. Recent studies utilize supervised learning methods to estimate the ideal binary mask (IBM) to solve the problem. In a supervised learning framework, the issue of generalization to conditions different from those used in training is paramount. This paper describes methods that require only a small training corpus but can generalize to unseen conditions. The system utilizes support vector machines to learn classification cues and then employs a rethresholding method to estimate the IBM. A distribution fitting method is used to address unseen signal-to-noise ratio conditions and an iterative voice activity detection is used to address unseen noise conditions. Systematic evaluations show that the proposed approach produces high quality IBM estimates under unseen conditions.
展开▼