DOSE+

Abstract

Diffusion-based speech enhancement (SE) models have recently demonstrated superior performance compared to traditional single-step models. In this work, we revisit the advantages of diffusion models from a multi-source learning perspective, highlighting that their ability to jointly leverage data likelihood and conditional mapping makes them theoretically superior to deterministic models when controllability is ensured. From this standpoint, we identify a key limitation in DOSE, a recent diffusion-based SE model that enhances controllability by applying fixed dropout ratio to non-conditional inputs, leading to unnecessary information loss at every timestep. To address this, we propose a timestep-aware dropout mechanism that dynamically adjusts the dropout intensity at each denoising step. Extensive experiments across matched and cross-dataset benchmarks show that our method consistently outperforms DOSE and other state-of-the-art diffusion-based SE methods, achieving superior speech enhancement with high efficiency. The code and audio samples are publicly available at here.