Technology
MimicScribe runs speech recognition on-device using Apple Silicon.
Audio Pipeline
Audio Capture
Microphone input is captured via Audio Queue Services
(AudioToolbox). System audio for meeting recording uses Core Audio
process taps (CATapDescription + AudioHardwareCreateProcessTap), which capture at the OS level without injecting into individual
applications. Both paths produce 16 kHz mono Float32 PCM.
Echo Cancellation
During meeting recording, the microphone picks up both the local speaker and remote participants' audio from the speakers. A DTLN (Dual-signal Transformation LSTM Network) model runs on CoreML to remove the system audio loopback from the microphone signal before it reaches the ASR model, using the system audio stream as a reference. No AGC or noise suppression is applied — these degrade ASR accuracy.
Watch: Real-Time Neural Echo Cancellation on macOSSpeech Recognition
Transcription uses NVIDIA's Parakeet TDT 0.6B model, converted to CoreML format with a Token-and-Duration Transducer (TDT) architecture and FastConformer encoder. The model supports 25+ languages and all inference runs locally on Apple Silicon. During meeting recording, dual parallel ASR workers process the microphone and system audio streams independently.
For streaming transcription, a custom harness processes audio in overlapping windows and merges results across window boundaries to maintain accuracy at chunk edges. The decoder, joint network, and preprocessor run through an accelerated BNNS backend, bypassing CoreML's per-inference overhead for lower latency.
Speaker Diarization
After a meeting ends, an offline diarization pipeline runs in the background while the live transcript is already displayed. A pyannote-based segmentation model identifies speaker turn boundaries, then a WeSpeaker v2 embedding model extracts 256-dimensional voice embeddings for each segment. Clustering uses agglomerative hierarchical clustering (AHC) with centroid linkage, refined by a Variational Bayes HMM (VBx) pass over PLDA-whitened features.
Clustered embeddings are matched against stored speaker profiles using cosine similarity. When a confident match is found, the speaker is identified automatically. Ambiguous cases are resolved by an LLM disambiguation step using conversation context. Profiles accumulate samples across meetings, improving recognition over time.
LLM Post-Processing
After transcription, text is sent to Google's Gemini 3 Flash for refinement: grammar correction, filler word removal, and tone adjustment. Meeting transcripts are additionally processed for speaker attribution, summarization, and action item extraction. Static prompt content is passed as system instructions to take advantage of Gemini's implicit caching, reducing latency on repeated calls.
Only text is sent to the API — audio remains on-device. This step requires an internet connection; raw transcription works fully offline.
Local + Cloud: How It Works
Audio capture, echo cancellation, speech recognition, and speaker diarization all run on your Mac — audio never leaves the device. Text intelligence (grammar correction, speaker attribution, meeting summaries, action items, and real-time suggestions) is handled by a cloud LLM that only receives transcript text.
What I tested
I evaluated Apple Intelligence (on-device dictation and summarization), Qwen 3, and other local models for the text processing pipeline. The results fell short in three areas:
- Speaker attribution. Assigning dialogue to the correct speaker in a meeting transcript requires holding long context and reasoning about conversational flow — who's responding to whom, topic continuity, and turn-taking patterns. Local models produced unreliable speaker labels, especially in meetings with more than two participants. The cloud LLM achieves 97% speaker accuracy on real conference calls — see the benchmark.
- Action items and meeting intelligence. Extracting action items, generating summaries, and producing real-time suggestions during a meeting all require the model to reason over the full conversation context. These tasks demand larger models than what can run locally with acceptable latency.
- Technical jargon and proper nouns. Domain-specific terms, product names, and acronyms need contextual correction that smaller models hallucinate or skip entirely.
What this means for privacy
Only transcript text is sent to the cloud — never audio, never recordings. The LLM sees the same thing you'd see if you copied the transcript into a text file: words on a page, with no voice data, microphone input, or system audio attached. Raw transcription always works fully offline.
For meetings where transcript text must not leave your device, Local Mode disables all cloud processing. You get on-device transcription with speaker separation — no summaries, no speaker names, no action items. If you change your mind, you can process the meeting with AI afterward.
I continue to evaluate local models as they improve. The goal is to move more processing on-device as quality catches up — but not at the cost of accuracy.
Benchmarks
Speaker Diarization
View resultsSpeaker attribution accuracy across 52 audio files from 4 public corpora — earnings calls, panel discussions, and oral arguments. Compares on-device diarization alone vs. the full pipeline with LLM speaker attribution.
Meeting Assistant
View resultsReal-time briefing quality across 96 scenarios — sales discovery, customer success, standups, interviews, and long meetings. Evaluates action items, hallucination resistance, question detection, and interpersonal awareness using dual LLM judges.
Context Retrieval
View resultsReference document RAG pipeline across 15 document types — CRM records, scraped webpages, SEC filings, PDFs, tracked-changes contracts, strategic plans, competitive intel, and messy meeting notes. Tests whether relevant chunks are retrieved accurately during live meetings, even from noisy real-world sources.
Meeting Search
View resultsHybrid semantic and full-text search across 26 meetings with 138 queries. Tests recall and ranking accuracy for finding past meetings by topic, action item, or conversational phrase.
Cost Comparison
View details →Side-by-side pricing against pasting raw transcripts into Claude, GPT-5, Gemini, and DeepSeek — single-meeting summaries, multi-step Q&A, and cross-meeting search. Quality judged on the same rubric.
Open Source Dependencies
FluidAudio
MITASR inference and speaker diarization. Wraps NVIDIA Parakeet TDT models, pyannote segmentation, and WeSpeaker v2 embeddings compiled for CoreML.
dtln-aec-coreml
MITDTLN echo cancellation network compiled for CoreML. Removes system audio loopback from microphone input.
GitHubswift-bnns-graph
MITSwift wrapper for Apple's BNNS graph API (BNNSGraphCompileFromFile). Used to run decoder and joint models with direct memory access
instead of CoreML's MLMultiArray overhead.
GRDB.swift
MITSQLite toolkit. Stores transcriptions, meetings, speaker profiles, and vector embeddings locally.
GitHubSparkle
MITmacOS software update framework. Handles delta updates and EdDSA signature verification.
GitHubswift-embeddings
Apache 2.0On-device text embedding using all-MiniLM-L6-v2. Powers semantic search over meeting transcripts and reference document retrieval for meeting context.
GitHubswift-collections
Apache 2.0Data structure extensions for the Swift standard library (OrderedDictionary, Deque, etc.).
GitHub