Technology

MimicScribe runs speech recognition on-device using Apple Silicon.

Audio Pipeline

Audio Capture

Microphone input is captured via Audio Queue Services (AudioToolbox). System audio for meeting recording uses Core Audio process taps (CATapDescription + AudioHardwareCreateProcessTap), which capture at the OS level without injecting into individual applications. Both paths produce 16 kHz mono Float32 PCM.

Echo Cancellation

During meeting recording, the microphone picks up both the local speaker and remote participants' audio from the speakers. A DTLN (Dual-signal Transformation LSTM Network) model runs on CoreML to remove the system audio loopback from the microphone signal before it reaches the ASR model, using the system audio stream as a reference. No AGC or noise suppression is applied — these degrade ASR accuracy.

Speech Recognition

Transcription uses NVIDIA's Parakeet TDT 0.6B model, converted to CoreML format with a Token-and-Duration Transducer (TDT) architecture and FastConformer encoder. The model supports 25+ languages and all inference runs locally on Apple Silicon. During meeting recording, dual parallel ASR workers process the microphone and system audio streams independently.

For streaming transcription, a custom harness processes audio in overlapping windows and merges results across window boundaries to maintain accuracy at chunk edges. The decoder, joint network, and preprocessor run through an accelerated BNNS backend, bypassing CoreML's per-inference overhead for lower latency.

Speaker Diarization

After a meeting ends, an offline diarization pipeline runs in the background while the live transcript is already displayed. A pyannote-based segmentation model identifies speaker turn boundaries, then a WeSpeaker v2 embedding model extracts 256-dimensional voice embeddings for each segment. Clustering uses agglomerative hierarchical clustering (AHC) with centroid linkage, refined by a Variational Bayes HMM (VBx) pass over PLDA-whitened features.

Clustered embeddings are matched against stored speaker profiles using cosine similarity. When a confident match is found, the speaker is identified automatically. Ambiguous cases are resolved by an LLM disambiguation step using conversation context. Profiles accumulate samples across meetings, improving recognition over time.

LLM Post-Processing

After transcription, text is sent to Google's Gemini 3 Flash for refinement: grammar correction, filler word removal, and tone adjustment. Meeting transcripts are additionally processed for speaker attribution, summarization, and action item extraction. Dictation uses Gemini 3.1 Flash Lite for lower latency. Static prompt content is passed as system instructions to take advantage of Gemini's implicit caching, reducing latency on repeated calls.

Only text is sent to the API — audio remains on-device. This step requires an internet connection; raw transcription works fully offline.

Open Source Dependencies

FluidAudio

MIT

ASR inference and speaker diarization. Wraps NVIDIA Parakeet TDT models, pyannote segmentation, and WeSpeaker v2 embeddings compiled for CoreML.

GitHub

dtln-aec-coreml

MIT

DTLN echo cancellation network compiled for CoreML. Removes system audio loopback from microphone input.

GitHub

swift-bnns-graph

MIT

Swift wrapper for Apple's BNNS graph API (BNNSGraphCompileFromFile). Used to run decoder and joint models with direct memory access instead of CoreML's MLMultiArray overhead.

GitHub

GRDB.swift

MIT

SQLite toolkit. Stores transcriptions, meetings, speaker profiles, and vector embeddings locally.

GitHub

Textual

MIT

Attributed string rendering for Markdown content in SwiftUI.

GitHub

Sparkle

MIT

macOS software update framework. Handles delta updates and EdDSA signature verification.

GitHub

swift-collections

Apache 2.0

Data structure extensions for the Swift standard library (OrderedDictionary, Deque, etc.).

GitHub