Intro
I’ve always been fascinated by music production, and the skill it takes to hear a sound and be able to almost immediately suss out how it was created, the effect chain it has undergone. It is a skill I sorely lack, but stubbornly want to learn.
So, to help me out, I’m developing RIPPER (terrible name, I know). I had three non-negotiables when starting to write it up:
- It should run locally. Remote computing can get quite expensive, and since this is more of a pedagogical tool, the results don’t have to be perfect. This heavily influences the approaches I can take, since I have limited computing power. It is also a great opportunity for me to dive deeper into GPU optimization, which is a topic I’ve been diving deeper into in the last year 1.
- It should be interpretable, or at the very least, I should be able to edit and train my own models if quality is demonstrably lacking.
- It should have a clean and understandable interface, directed towards a learner. Even though I’m a CLI fiend, my audio production workflow is heavily reliant on visual cues and markings.
Design choices and limitations
The pipeline consists of two main stages:
- Audio source separation (AUSS)
- Sound matching and reconstruction
This should allow us to isolate specific sounds and approximate them using common audio effects (LFO, simple oscillators, etc). AUSS is commonly used to extract stems2 which can be hard to untangle, since we are bundling different instruments into the same structure. Right now, this is being handled by use of cascading source separation with different models, but that is an inefficient approach.
Audio Source separation
In part 2 we’ll do a more technical dive into how we achieve local audio source separation in Ripper.
Sound matching and reconstruction
I floated around several methods to do sound reconstruction, but quickly came upon a major issue. Emulating a sound from an eletroacoustic instrument such as a guitar, bass and keyboards requires knowledge of the base timbre and sound profile of the instrument before it is fed through the signal chain. The pedagogic value of RIPPER is not really affected by this, since it can always be valuable to take sounds for which these are known and giving the user an idea of the effects being applied. For electronic music, this problem is minimized, but there will of course be false positives.
In part 3 we’ll do a more technical dive into how we achieve sound reconstruction using DDSP in Ripper.
UI

-
The state of GPU programming in Rust is improving by the day. The work being done by the fine folks at Rust-GPU is tremendous, and has lead me to many a session of banging my head against the wall attempting to tackle with Vulkan and SPIR-V. ↩︎
-
Stems are audio-production shorthand for a collection of similar tracks. Taking drums as an example, a single drums stem might have upwards of 10 tracks (one for the bass drum, one for the snare, one for each cymbal, one for each tom, etc). ↩︎