Technology
MimicScribe runs speech recognition on-device using Apple Silicon.
Audio Pipeline
Audio Capture
Microphone input is captured via Audio Queue Services (AudioToolbox).
System audio for meeting recording uses Core Audio process taps (CATapDescription + AudioHardwareCreateProcessTap), which capture at the OS level without injecting into individual
applications. Both paths produce 16 kHz mono Float32 PCM.
Echo Cancellation
During meeting recording, the microphone picks up both the local speaker and remote participants' audio from the speakers. A DTLN (Dual-signal Transformation LSTM Network) model runs on CoreML to remove the system audio loopback from the microphone signal before it reaches the ASR model, using the system audio stream as a reference. No AGC or noise suppression is applied — these degrade ASR accuracy.
Speech Recognition
Transcription uses NVIDIA's Parakeet TDT 0.6B model, converted to CoreML format with a Token-and-Duration Transducer (TDT) architecture and FastConformer encoder. The model supports 25+ languages and all inference runs locally on Apple Silicon. During meeting recording, dual parallel ASR workers process the microphone and system audio streams independently.
For streaming transcription, a custom harness processes audio in overlapping windows and merges results across window boundaries to maintain accuracy at chunk edges. The decoder, joint network, and preprocessor run through an accelerated BNNS backend, bypassing CoreML's per-inference overhead for lower latency.
Speaker Diarization
After a meeting ends, an offline diarization pipeline runs in the background while the live transcript is already displayed. A pyannote-based segmentation model identifies speaker turn boundaries, then a WeSpeaker v2 embedding model extracts 256-dimensional voice embeddings for each segment. Clustering uses agglomerative hierarchical clustering (AHC) with centroid linkage, refined by a Variational Bayes HMM (VBx) pass over PLDA-whitened features.
Clustered embeddings are matched against stored speaker profiles using cosine similarity. When a confident match is found, the speaker is identified automatically. Ambiguous cases are resolved by an LLM disambiguation step using conversation context. Profiles accumulate samples across meetings, improving recognition over time.
LLM Post-Processing
After transcription, text is sent to Google's Gemini 3 Flash for refinement: grammar correction, filler word removal, and tone adjustment. Meeting transcripts are additionally processed for speaker attribution, summarization, and action item extraction. Dictation uses Gemini 3.1 Flash Lite for lower latency. Static prompt content is passed as system instructions to take advantage of Gemini's implicit caching, reducing latency on repeated calls.
Only text is sent to the API — audio remains on-device. This step requires an internet connection; raw transcription works fully offline.
Open Source Dependencies
FluidAudio
MITASR inference and speaker diarization. Wraps NVIDIA Parakeet TDT models, pyannote segmentation, and WeSpeaker v2 embeddings compiled for CoreML.
GitHubdtln-aec-coreml
MITDTLN echo cancellation network compiled for CoreML. Removes system audio loopback from microphone input.
GitHubswift-bnns-graph
MITSwift wrapper for Apple's BNNS graph API (BNNSGraphCompileFromFile). Used to run decoder and joint models with direct memory access
instead of CoreML's MLMultiArray overhead.
GRDB.swift
MITSQLite toolkit. Stores transcriptions, meetings, speaker profiles, and vector embeddings locally.
GitHubSparkle
MITmacOS software update framework. Handles delta updates and EdDSA signature verification.
GitHubswift-collections
Apache 2.0Data structure extensions for the Swift standard library (OrderedDictionary, Deque, etc.).
GitHub