I host and edit a podcast1. When recording remotely, we each record our own audio locally (I on my end, my co-host on his). The service we use (Adobe Podcast, Zoom, Skype-RIP) captures everyone together as a master track. But the quality doesn’t match what each person records locally with their own microphone. So we use that master as a reference point and stitch the individual local tracks together.

This is what the industry calls a “double-ender”. Add a guest and it becomes a “triple-ender”.

But this gets hairy during editing. Each person starts their recording at a slightly different moment — everyone hits record at a different time.

Before I can edit, I need to line everything up. Drop all the tracks into a DAW, play the master alongside each individual track, nudge by ear until the speech aligns. Add a guest and it gets tedious fast. 10–15 minutes of fiddly, ear-straining alignment before I’ve even started editing.

There’s also drift. Each machine’s audio clock runs at a slightly different rate, so two tracks that are perfectly aligned at minute one might be 200ms apart by minute sixty.

So I built PodSync2.

I’ve wanted this since 2019 #

I first heard of a similar technique from Marco Arment — back in ATP episode 25. He had a new app for aligning double-ender tracks and was already thinking about whether something so niche was even worth releasing publicly. I don’t think he ever released it.

Being a Kotlin developer at the time, I figured I’d build my own. Java was mature. Surely there were audio processing libraries that could handle this.

There weren’t 😅. At least not in any clean, usable form. Getting the right signal processing pieces together in JVM-land was awkward enough that my interest fizzled, so I kept doing it by hand.

tis the age of AI #

When I revamped Fragmented, I finally came back to this. I used Claude to help me build it — in Rust, no less.3

But before you chalk this up to another vibecoded project, hear me out. The interesting part here wasn’t just that AI made it easier. It was thinking through the actual algorithm:

Voice activity detection (VAD) to find speech regions. MFCC features to fingerprint the audio. Cross-correlation to find where the tracks match. Some real signal processing techniques, not just prompt engineering. Now, could I have prompted my way to a solution? Probably. But I like to think, years of manually aligning tracks — and some sound engineering intuition — helped me steer AI towards a better solution.

Working on this felt refreshing.

In an era where half the conversation is about AI replacing engineering work, here’s a problem where the hard part is still the problem itself — understanding the domain, picking the right approach, knowing what “correct” sounds like.

It gives me confidence that solving real problems well still has its place. I like how Dax put it:

thdxr on twitter

I really don’t care about using AI to ship more stuff. It’s really hard to come up with stuff worth shipping.

How it works #

The core idea: take a chunk of speech from a participant track, compare it against the master recording, find where they match best. That position is the time offset.

The trick is picking which chunk of speech to use. Rather than betting on a single region, Podsync finds a few strong candidates per track (longer contiguous speech blocks preferred) and tries each one against the master. For long candidates, it samples from the start, middle, and end. The highest-confidence match wins; if a second independent region agrees on the same offset, that corroboration factors in as a tie-breaker.

After finding the offset, Podsync pads or trims each track to align with the master and match its length (and outputs some info on the offset).

What it looks like #

podsync \
  --master  "304-src-ap.mp3" \
  --tracks  "304-iury-src-cleaned.wav" \
  --tracks  "304-kg-src-logic-cleaned.wav"

Drop the output into my DAW at 0:00. Done.

I even wrote an agent skill you can just point your agent harness to and it will take care of all the steps for you:

Podsync output showing both tracks synced at confidence 1.00 with zero drift

What used to be 10–15 minutes of alignment per episode is now a single command.

Addendum #

Marco, if you ever read this, would still love to see your implementation!

His solution (as I understand) is aimed more at correcting the drift vs getting the offset right. In practice, I haven’t found drift to be much of a problem. It exists but stays minor, and I’m typically editing every second of the podcast anyway so it’s easy enough to handle by hand. I even had a branch that corrected drift by splicing at silence points, but it complicated things more than it helped.


  1. It’s a podcast on AI development but we strive to make it high signal. None of that masturbatory AI discourse↩︎

  2. See also Phone-sync↩︎

  3. I chose Rust (it’s what interests me these days) and a CLI tool with no runtime dependency is more pleasant to distribute. ↩︎