Skip to Main content Skip to Navigation

Informed audio source separation with deep learning in limited data settings

Abstract : Audio source separation is the task of estimating the individual signals of several sound sources when only their mixture can be observed. State-of-the-art performance for musical mixtures is achieved by Deep Neural Networks (DNN) trained in a supervised way. They require large and diverse datasets of mixtures along with the target source signals in isolation. However, it is difficult and costly to obtain such datasets because music recordings are subject to copyright restrictions and isolated instrument recordings may not always exist.In this dissertation, we explore the usage of additional information for deep learning based source separation in order to overcome data limitations.First, we focus on a supervised setting with only a small amount of available training data. We investigate to which extent singing voice separation can be improved when it is informed by lyrics transcripts. To this end, a novel deep learning model for informed source separation is proposed. It aligns text and audio during the separation using a novel monotonic attention mechanism. The lyrics alignment performance is competitive with state-of-the-art methods while a smaller amount of training data is used. We find that exploiting aligned phonemes can improve singing voice separation, but precise alignments and accurate transcripts are required.Finally, we consider a scenario where only mixtures but no isolated source signals are available for training. We propose a novel unsupervised deep learning approach to source separation. It exploits information about the sources' fundamental frequencies (F0). The method integrates domain knowledge in the form of parametric source models into the DNN. Experimental evaluation shows that the proposed method outperforms F0-informed learning-free methods based on non-negative matrix factorization and a F0-informed supervised deep learning baseline. Moreover, the proposed method is extremely data-efficient. It makes powerful deep learning based source separation usable in domains where labeled training data is expensive or non-existent.
Complete list of metadata
Contributor : ABES STAR :  Contact
Submitted on : Tuesday, January 4, 2022 - 5:12:08 PM
Last modification on : Wednesday, January 5, 2022 - 3:06:14 AM


Version validated by the jury (STAR)


  • HAL Id : tel-03511031, version 1



Kilian Schulze-Forster. Informed audio source separation with deep learning in limited data settings. Signal and Image Processing. Institut Polytechnique de Paris, 2021. English. ⟨NNT : 2021IPPAT032⟩. ⟨tel-03511031⟩



Record views


Files downloads