ALPHA Prototype · waveform decomposition 11 recordings analysed

Sonic
Decomposition.

An AI-assisted toolkit for pulling a recording apart using song-file data, web-based context, and a user's ear. A series of modules progressively extract named components of the waveform — spectra, partials, transients, formants, residual — so that a song or composition can be inspected, reasoned about, or resynthesised on its own terms.

What's new is the workflow, not the algorithms. The signal-processing steps are 1960s–2000s standards: short-time Fourier transform, McAulay-Quatieri sinusoidal partial tracking, linear-predictive coding for formants, harmonic-percussive source separation, chroma plus self-similarity matrix for structure, Krumhansl-Schmuckler key-finding. The contribution is how they're assembled — two engines that don't see each other's evidence, a recurring conference — Φ — that reconciles them and folds each result forward as a prior, a deliberate refusal to name anything until enough independent fixes line up, and a diagnostic residual (D1) that audits every commitment by trying to rebuild the recording from it.

Waveform decomposition Sinusoidal partial tracking Component-level resynthesis Diagnostic residual Two-engine architecture
01 Architecture

The Flow.

The run has a shape: a boot, one fixed orientation pass, a flat transcription stage, and a conference the whole thing spins around. At Boot 0 both engines and the Φ conference layer load and stay resident — the foundation every later step re-enters. Orientation then runs both engines once, shallow, to set the first priors. After that the middle is a flat, user-led stage: tones, hits, voice, audit and structure are run in whatever order the work wants (percussion tends to go early, voice late — a preference, not a rule), and each one is the same two engines re-entered deeper, not a new machine. Every round ends in Φ, the conference: binary, web and user reconcile, a name is committed, and the resolution folds forward as a prior for the next round. Click any module in the diagram to jump to its write-up below.

Simple run map — a boot, one orientation, then the transcription stage spinning around the Φ conference 0 · BOOT 0 · RESIDENT FOUNDATION both engines + Φ layer load & stay resident 1 · ORIENTATION · THE ONE FIXED PASS both engines run once, shallow → first Φ enters the stage Φ THE PINWHEEL confer · commit · fold fwd 2·1 · STAGE Tone binary ‖ web · user 2·2 · STAGE Hits binary ‖ web · user 2·3 · STAGE Voice binary ‖ web · user 2·4 · STAGE Audit binary ‖ web · user 2·5 · STAGE Structure binary ‖ web · user ANY ORDER · YOUR CALL → later discourse
  1. 0 · Boot 0 · resident foundationboth engines + the Φ conference layer load & stay resident
  2. Orientation the one fixed pass
    Binary ‖ Web · UserBoth engines run once, shallow → first Φ
  3. Transcription stage · any order · your call
  4. Tone
    binary ‖ web · usertrajectory mode
  5. Hits
    binary ‖ web · userpercussion & onset
  6. Voice
    binary ‖ web · uservocal & formants
  7. Audit
    binary ‖ web · userD1 — resynth & residual
  8. Structure
    binary ‖ web · userstructure & harmony
  9. Φ · the pinwheel — recurs after every module, folds findings forward as priors
0 · Boot 0 · loads resident.mp3 / .wav + artist & title — both engines + Φ layer held residenthover · 25 files ▸
1Orientationfixed passloads ▸
Transcription stage · any order · user-led
2·1Toneloads ▸
2·2Hitsloads ▸
2·3Voiceloads ▸
2·4Auditloads ▸
2·5Structureloads ▸

Hover or tap any section for the entirety of what loads there · click a binary, web·user or Φ node to jump to its write-up.

Boot 0 loads first and stays resident; Orientation is the one fixed pass; Tone, Hits, Voice, Audit and Structure are a flat stage run in any order; every round ends in the Φ conference, which folds its findings forward as priors. Switch Simple / Detailed; click any module, binary, web·user or Φ node to jump to its write-up.
02 Modules

What each module does.

The modules from the pipeline for pulling a raw sound file apart for musical analysis. A series of pair modules extracts components of the waveform and interprets them alongside web and user-derived context, in order to isolate musical elements via the waveform — spectra, partials, transients, formants, residual — so that each musical component can later be inspected, resynthesised, and reasoned about on its own terms.

The engine alternates between: a binary module (the A side) reads the audio file, and a (B side) web/user module, running in tandem. Using what’s gleaned from both the binary data alongside the cultural context and any user input, a moment of conference (the chat) reconciles what is true across all domains before the next module begins (drums are here in the mix, lead vocalist is a woman, etc.).

The engine holds attributes provisionally — recognised as repeated and regular, but not yet given a specific name. Each conference is where the User directs to commit a name, based on the combined evidence from the binary side (what's measurable in the audio), the web side (what's credited, claimed, written about), and the user side (what the human somatically or experientially confirms or corrects). The conference recurs at the end of every module pass — it's the connection or transference between the gleaned and the discussed — and each round of conference and reconciliation hands its resolutions forward as priors for the next round. In this way it is a collaborative effort, and the greater effort a user brings to the space, the stronger and more quality the final analytical output will be.

03 Plates

Working figures — leftovers.

Two figures that didn’t slot into any module — preserved here as primary-source field notes.

Plate 10. Engines used: Phase 1: orientation → extraction → refinement
Plate 10
Single-session vocal passOne vocal session, full Phase-1 pipeline — every Binary Engine module’s read over the same recording. editorial
Plate 37. Engines used: same fold detection · extended-window rendering
Plate 37
Coherence snapshot (extended)The same snapshot, longer window. The agreement is brief — perhaps a second — but the conditions leading into and out of it are part of the reading.
04 What it’s for

Two recordings, compared by component.

Because every extraction is named and traceable, two recordings can be compared not at the level of an embedding similarity but at the level of specific components: do their formant trajectories overlap? do their percussion timings sit in the same ratio? does one’s residual look like the other’s?

The intended workload is structural diffing across recordings — pointing at the specific component that two otherwise-different recordings share, and folding whatever the engine learned back into the spectral roster so subsequent passes start one step further on.

05 Status

Where the prototype stands.

24
components
named, addressable parts the binary engine can extract and resynth (e.g. snare body, vocal F2, kick sub)
64
sonic fingerprints
recurring signal-shapes catalogued across analysed recordings (gated reverb, glue-comp pumping, etc.)
58
genres mapped
genres with at least metadata, credits, and reception-language entries in the web-engine library
20
genre baselines
genres with a full set of expected production traits the conference can run markedness checks against (LLM-summarised, not corpus-measured — see Step 4 in Web Engine)

The framework is in alpha. Eleven recordings have been analysed end-to-end. The Binary Engine modules — spectral roster, trajectory mode, percussion, vocal LPC, D1 resynthesis, structure — ship as text-spec plus runnable Python. The Web Engine ships the same way. The Conference’s three filters are implemented at every Φ conference.

Conversational ground truth ranks above engine values. The engine is allowed — encouraged — to surface a coherent hypothesis even when it can't yet prove the claim from its own parameters. The residual decides which hypotheses to keep.

Honest failure mode: the confirmation engine.
The rule above — felt response outranks engine values — is also the rule that lets this pipeline become a machine for agreeing with the user's ear. If the user is confident, the engine has a structural incentive to recover that confidence in the residual. We name this here rather than letting a reader catch it: the conference is bidirectional in principle, but a careless run will drift toward whatever the user heard first.

Three things push back against the drift in practice. (i) D1 resynthesis: every commitment has to survive a rebuild of the recording from the named components. If a singer is committed as male and D1 returns a clearly female resynth, the commit is forced back open even when the user is confident. (One such hold-out is recorded in the project's prediction-accuracy log against the Coltrane / Hartman session, where the engine refused a vibrato attribution the user had assumed.) (ii) The web/binary independence: when the two sides disagree before the conference, the user is shown both claims and the disagreement itself before being asked which to commit. (iii) The residual is a one-way audit — the user can't talk it down; either the rebuild accounts for the recording or it doesn't.

None of these eliminate the failure mode. They bound it.
06 Colophon

Working notes, not a product page.

Figures are reproduced from the sessions in which they were made. The point of writing them up here is to give the named components a place to be inspected rather than to advertise the pipeline.

Built with a computational collaborator. Technical specs and runnable code live in the project repository alongside liner-notes.md.

You don't have to like Dilla to cowork here but it helps.