Building a macOS App with Handwritten Code in 2026

Just kidding — this would never have been possible as a one-person project without AI coding agents.

MimicScribe is a macOS menu bar app for meeting transcription. On-device speech recognition, speaker diarization, echo cancellation, real-time AI assistant. One developer, 1,200 commits over eight months, roughly 70,000 lines of Swift. A couple years ago, this would have needed a small team and a year-plus timeline. That part has genuinely changed. What hasn’t changed is everything that makes it hard.

Most of what this app does is stitch together work from major research labs and make it accessible. NVIDIA’s Parakeet model for speech recognition. Google’s Gemini for text refinement and meeting intelligence — ASR models produce phonetically accurate text but miss punctuation, capitalization, and domain-specific terms, so an LLM cleans things up. LocalAI’s LocalVQE model for echo cancellation. None of this is my research. The value is in making these things work together reliably, on a real person’s Mac, during a real meeting.

Here’s how to build a transcription app this weekend

Seriously — you can do this in a weekend. The first version of this project was a Python script: Whisper for transcription, a handful of libraries for audio capture, an LLM call to extract action items. It worked. For a five-minute voice memo in a quiet room, it worked really well. You press record, you wait, you get a transcript. The model does the hard part.

The trouble starts when someone tries to use it in a real meeting. Two hours long. Multiple speakers. System audio from a video call mixed with a laptop microphone. Suddenly “feed buffers to the model” isn’t a single function call anymore.

Streaming ASR: the problem nobody warns you about

The speech recognition model processes audio in fixed windows — about 15 seconds at a time. For a short recording, you just run the whole thing when the user stops. For a 90-minute meeting, that’s not an option. You need to process overlapping windows in real-time as audio comes in, then stitch them together without dropping words at the boundaries.

January 18 — the first working implementation. Overlapping windows with timestamp-based merging. It works on test audio. Ship it, start testing with real meetings.

January 21 — audio buffers filling from Core Audio on one thread while the model reads them on another. In most apps, a race condition crashes. In audio processing, it silently produces garbage output. You don’t notice until someone’s transcript is missing a paragraph.

January 29 — word loss at the 15-second window boundaries. The merge algorithm drops tokens that fall right at the seam between two windows. Someone says “the quarterly revenue report” and the transcript reads “the revenue report.” The fix isn’t obvious because the words aren’t missing — they’re present in one window but get discarded during the merge because their timestamps overlap with the next window’s context region.

February 12 — the third rewrite of the merge algorithm. Anchor tokens to known-good positions in each window, then merge outward. Handle silence gaps that throw off timestamp alignment — if someone pauses for ten seconds mid-sentence, the old algorithm lost its bearings entirely.

February 13 — the final-chunk merge. When the user stops recording, re-process the last segment with full context and run a suffix-prefix merge against the streaming output. This is where accuracy goes from “pretty good” to “I’d trust this.” The streaming windows are optimized for speed; the final pass is optimized for correctness.

Five commits across a month. The weekend prototype doesn’t have any of this — and it doesn’t need to, until a real user records a real meeting and asks why half a sentence is missing on page three.

Echo cancellation: three tries

When you record a meeting on a laptop, the system audio — everyone else on the call — plays through the speakers and gets picked up by the microphone. Your speech model transcribes both the clean system audio and the degraded echo from the mic, producing duplicate text with garbled overlaps.

Attempt one was WebRTC AEC3, the industry standard for echo cancellation in communication apps. It technically worked, but required shipping a compiled C++ library, had platform-specific deployment issues, and the confidence-based echo suppression needed constant tuning. Five commits in a single day trying to get the thresholds right.

Attempt two was simpler: just mute the microphone signal when echo is detected. Too aggressive. It killed the user’s real speech whenever it overlapped with system audio, which in a conversation is most of the time.

Attempt three was DTLN-aec — a neural network trained specifically for echo cancellation, running as a CoreML model alongside the speech recognition model. No tuning knobs, no C++ dependency, no threshold adjustments. It just works. Shipped it about seven weeks after the first WebRTC attempt.

Sometimes the right answer is to use a better model. But I had to try the wrong answers first to know that. An LLM can generate the integration code for any of these three approaches in minutes — and it will confidently recommend whichever one you ask about. The same AI that helped me integrate WebRTC would have happily helped me skip it entirely if I’d known to ask for DTLN-aec from the start. It doesn’t know what works in production, because that depends on factors that don’t show up in documentation. You still have to run the meeting, listen to the output, and hear the problem yourself.

If you’re curious what this looks like in practice, there are demos of the echo cancellation and meeting assistant on YouTube.

What people are paying for

Every one of those edge cases — word loss at boundaries, echo duplication, silence gaps — becomes a regression test. Because the next time you refactor the merge algorithm, and there will be a next time, you need to know right away if you’ve brought back a bug that took weeks to find originally. AI-assisted refactors are fast, which also means they break things fast.

Users don’t see the merge algorithm or the echo cancellation pipeline. They see: I pressed the button and it worked. That’s what they’re paying for.

This app was built across a turning point in the tools. Early on, I was reading and correcting every generated function — making sure it used the right concurrency patterns, catching mistakes before they piled up. Then midway through, a new generation of coding models came out — Anthropic’s Sonnet 4.5, OpenAI’s Codex 5.2 — and things changed. I stopped co-writing code and started describing what I wanted, then reviewing the output. What one person could produce went way up. But knowing what to build, what to test, when to ship — that didn’t change at all.