Automatically format numbers, dates, and phone numbers in transcriptions for cleaner output

Testing

#24 · by Marshall (developer) · Mar 8, 2026 · Upcoming v1.0.0-beta.6

Automatically format numbers, dates, and phone numbers in transcriptions for cleaner output. Converts spoken-form ASR output into proper written form — for example: 'twenty three dollars' → '$23', 'march seventh' → 'March 7th', 'one eight hundred five five five twelve twelve' → '1-800-555-1212'. Runs as a lightweight inverse text normalization (ITN) pass after transcription without relying on LLM refinement.

Comments (1)

Marshall (developer) · Mar 9, 2026, 7:59 PM

Investigated integrating the text-processing-rs Rust ITN library. Currency ($22.50, $12), dates (january 5 2026), ordinals (21st, 3rd), decimals (2.5, 3.11), and percentages (75%) all normalize correctly.

However, several failure cases make it unsafe for production:

  • Data corruption: "twenty one to seventeen" → "16:39" (score misread as time)
  • Data corruption: "four zero four" → "8" (digit-spelling collapsed to arithmetic)
  • Over-eager: "one of the things" → "1 of the things" (discourse "one" always converted)
  • Dropped words: "three people and twenty five chairs" → "3 people 25 chairs" (conjunction "and" deleted)

Removing until the upstream library addresses these. The partial normalization (some numbers converted, others not) also risks confusing downstream LLM refinement.

0/2000