Local audio source separation

Intro

Audio source separation is a longstanding problem, with a myriad of working angles. Lately, we’ve seen the rise of diffusion¹ and LLM models applied to this problem, but since I want this to run locally and be interpretable, I’ll be sticking to more traditional architectures, such as auto variational encoders and deep neural networks.

Architecture

Based on these constraints, I settled on Demucs²³ and MelBandRoFormer⁴. These are well established approaches, each with its specific strength. There is no canonical way of running inference for either these models, so I settled on creating a small Python module to export them into ONNX, which can then be used in conjunction with Ort to run local inference.

First off, let’s take a quick look at how these models are structured, the challenges the architectures present to ONNX export and a quick and easy approach to solving those challenges.

Demucs

Demucs architecture

Demucs is a hybrid model that processes input both in the time-domain and the frequency-domain. In short, the time-domain branch operates on the raw values of the waveform (real numbers) while the time-frequency branch operates on the spectrogram of this waveform (complex numbers), immediately converting this spectrogram to a magnitude spectrum, which can be processed by the convolutional and transformer layers, and then inverting the result of this process. These two branches are then fused to return the separated stems⁵.

MelBandRoFormer

MelBandRoFormer architecture

MelBandRoFormer takes a completely different approach from Demucs, making use of a band-split mechanism, by which transformer units attribute different weights to different frequency bands of the input signal. The frequency bands are chosen to model human perception of sound, using Mel spectrograms.

ONNX

ONNX⁶ is the de facto open source format for all things AI and ML. Its major strengths are the interoperability and OOTB optimizations it provides, which made it a no brainer in Ripper.

However, there is a major hurdle in using this format for the models I’ve chosen. The usage of complex-valued in the previous architectures is completly at the odds with the ONNX format, which does not cleanly support complex valued tensors and operations. The suggested solution for these is using a specific onnx runtime and substituting these operation for matrix multiplications. This would require me to reimplement the whole frequency branch in Rust, which is something I really didn’t want to have to do.

To work around this, I follow the approach taken by Mixxx in their handling of Demucs, and express the Short-Time Fourier Transform (STFT) and Inverse Short-Time Fourier Transform (ISTFT) using pairs of real valued tensors, one for the sine, another for the cosine components. This is a standard approach when dealing with complex numbers, and requires only a small amount of Python code.

The state of GPU programming in Rust is improving by the day, and the work being done by the fine folks at Rust-GPU is inspiring, and has lead me to many a session of banging my head against the wall attempting to tackle with Vulkan and SPIR-V. ↩︎
Rouard, S., Massa, F., & Defossez, A. (2023). Hybrid Transformers for Music Source Separation. In ICASSP 23. ↩︎
Défossez, A. (2021). Hybrid Spectrogram and Waveform Source Separation. In Proceedings of the ISMIR 2021 Workshop on Music Source Separation. ↩︎
Ju-Chiang Wang, Wei-Tsung Lu, & Minz Won. (2023). Mel-Band RoFormer for Music Source Separation. ↩︎
Stems are audio-production shorthand for a collection of similar tracks. Taking drums as an example, a single drums stem might have upwards of 10 tracks (one for the bass drum, one for the snare, one for each cymbal, one for each tom, etc). ↩︎
https://onnx.ai/ ↩︎