Timestamp-correct, diarized meeting transcription. Point it at a video or audio file and get back a transcript where every speaker turn lands on the right millisecond — labelled with the speaker's name (when it can be inferred) and their tone — as SRT, Markdown, and JSON.
📝 Read the full write-up — "offmute-v2: GLM vs Opus" — coming soon at southbridge.ai/blog/offmute-v2-glm-vs-opus
It's the successor to offmute: the same great diarized, tone-aware transcripts — now with real timestamps, SRT output, a resumable pipeline, and a browser build. No single model is good at everything, so offmute-v2 uses each for what it's best at and fuses them.
export GEMINI_API_KEY=... # multimodal understanding (or GOOGLE_API_KEY)
export ASSEMBLYAI_API_KEY=... # word-level timing
npx offmute-v2 meeting.mp4 --instructions "Panel of founders; label them by name."
# → meeting.md (written right next to meeting.mp4)That's it. bunx offmute-v2 meeting.mp4 works too.
A clean, sub-second-aligned, speaker-labelled transcript with tone. By default offmute-v2 writes a
single meeting.md next to your input file — add --formats srt,md,json (and/or -o <dir>) for
SubRip subtitles and JSON. The same content, three ways:
meeting.srt (--formats srt — drop straight onto the video):
1
00:00:00,160 --> 00:00:00,720
Speaker D: GPU
2
00:00:01,199 --> 00:00:05,440
Presenter: And I'm inspired. I think I'm going to apply to NTU this fall. (confident, joking)meeting.md (the default — skimmable, grouped by speaker, with talk-time):
_Duration: 1914s · Speakers: 5_
## Speakers
- **Presenter** (1461s)
- **Speaker B** (97s) · **Speaker C** (87s) · ...
## Transcript
[00:00] **Speaker D**: GPU
[00:01] **Presenter** _(confident, joking)_: And I'm inspired. I think I'm going to apply to NTU this fall.meeting.json (--formats json) — every segment with start/end, speaker, text,
tone[], timingSource, confidence, and word-level timings — for downstream tooling.
LLMs are brilliant at understanding speech (who's talking, through interruptions, in a crowd, with tone) but terrible at timestamps. ASR models are the opposite: sub-second timing, but mediocre diarization and no sense of tone. offmute-v2 runs both and marries them:
| Job | Tool | Why |
|---|---|---|
| WHO / HOW — speakers, names, tone, hard audio | multimodal LLM (Gemini) | infers names from context, hears tone, handles crowds & interruptions |
| WHEN — word-level timestamps | ASR (AssemblyAI / Groq Whisper) | sub-second accurate; LLM timestamps drift minutes over a long file |
| fuse | edit-distance alignment | transfers the ASR's clock onto the LLM's richer words |
flowchart LR
IN["🎬 audio / video"] --> PRE["preprocess<br/>16kHz mono + keyframes"]
PRE --> LLM["multimodal LLM<br/>diarize · transcribe · tone<br/><i>(per chunk)</i>"]
PRE --> ASR["ASR<br/>word-level timestamps<br/><i>(whole file)</i>"]
LLM --> AL["align<br/>edit-distance fuse"]
ASR --> AL
AL --> GF["gap-fill<br/>recover dropped speech"]
GF --> CON["consistency<br/>stable speakers"]
CON --> ID["identify names<br/><i>(optional, level 3)</i>"]
ID --> FIN["finalize →<br/>SRT · MD · JSON"]
Measured on a hand-checked 32-minute talk (founder presentation + audience Q&A): ≈8% word
error rate and ~99% word-level speaker attribution, with turn boundaries riding AssemblyAI's
word timing (first cue at 00:00:00,160, exactly matching ground truth). Full methodology, runs,
and an independent re-score live in docs/ and the article.
- 🎯 Timestamp-correct — word-level timing from ASR, fused onto LLM text by alignment.
- 🎭 Diarization + tone — separates speakers through interruptions; annotates
(laughing),(hesitant),(confident), … - 🧑🤝🧑 Three levels of speaker labelling — separation → stable
Speaker A/B→ real names inferred from context (--level 3). - 🎬 Video-aware — samples keyframes for visual context (who's on screen, demos, slides).
- ⚡ Chunked + concurrent — long files are split with overlap and stitched back with ownership-partition dedup (no double-printed sentences at chunk seams).
- ♻️ Resumable & stoppable — every stage caches to disk (in the OS temp dir by default;
the path is printed each run, and
-i <dir>keeps it wherever you like). Re-runs skip finished work, Ctrl-C leaves partials. The cache is keyed on the input and the config, so changing--modelnever serves you a stale transcript. - 🔀 Pluggable providers — Gemini for the LLM; AssemblyAI or Groq Whisper for timing.
- 🌐 Runs in the browser — a pure, node-free core +
fetchproviders + ffmpeg.wasm. - 🔎 Fully inspectable — every LLM prompt+response is logged to
llm-calls.jsonl; all intermediates are plain JSON.
npx offmute-v2 <input> [options]| Option | Default | Description |
|---|---|---|
-o, --output <dir> |
next to input | where to write the transcript(s); default is the input file's own folder |
--instructions <text> |
– | guide diarization / labelling, e.g. "host is Alice; group callers as 'Caller'" |
--model <name> |
gemini-3.1-pro-preview |
multimodal LLM; use gemini-2.5-flash / gemini-flash-latest for faster/cheaper runs |
--level <1|2|3> |
2 |
1 = separation · 2 = stable anon · 3 = identify names |
--timestamped <p> |
assemblyai |
timing provider: assemblyai · whisper-groq · none |
--reasoner <name> |
deepseek-chat |
text model for the name-identification pass (level 3) |
--chunk-seconds <n> |
600 |
chunk length · --overlap-seconds (default 60) |
--formats <list> |
md |
which outputs to write (srt, md, json) |
--passes <list> |
all | run/resume a subset of stages |
--force |
– | ignore caches and recompute |
--only-chunk <n> |
– | process a single chunk (debugging) |
-i, --intermediates <dir> |
auto | cache dir (auto-derived per input) |
npx offmute-v2 meeting.mp4 --level 3 # name the speakers
npx offmute-v2 meeting.mp4 --model gemini-2.5-pro # higher quality
npx offmute-v2 talk.mov --timestamped whisper-groq # free/fast timing, no AssemblyAI
npx offmute-v2 meeting.mp4 --passes align,consistency,finalize # resume from cacheThe primary line takes a single options object. Only input is required — outputDir defaults to
the input file's folder and formats defaults to ["md"]:
import { transcribe } from "offmute-v2";
const { segments, speakers, metadata } = await transcribe({
input: "meeting.mp4",
// outputDir: "./out", // optional — defaults next to the input
model: "gemini-flash-latest",
level: 3,
instructions: "Three-person panel; label by name.",
formats: ["srt", "md", "json"], // optional — defaults to ["md"]
apiKeys: { gemini: "...", assemblyai: "..." }, // optional; falls back to env
});
console.log(segments[0]); // { start, end, speaker, text, tone, timingSource, ... }Individual stages (alignSegments, assignGlobalSpeakers, finalizeSegments, formatters, …) are
exported too, so you can build your own pipeline.
The fusion core (align / consistency / identify / finalize / format) is pure TypeScript with zero
node-only imports, so it bundles tiny and runs in the browser via offmute-v2/browser — using
fetch-based providers and ffmpeg.wasm for in-browser audio extraction. See
docs/ / the browser example for the integration seam.
A multi-stage pipeline where each stage persists a JSON intermediate (so it's resumable and debuggable):
- preprocess — ffmpeg → compact mono 16 kHz mp3 + scene-aware keyframes (a 9.6 GB
.movbecomes a few MB of audio in seconds — transparent for speech, but a fraction of lossless size). - describe — a quick multimodal pass builds a meeting summary + speaker roster to prime transcription.
- llm-transcribe — each chunk goes to the LLM for verbatim text, diarization, and tone. Output
is plain text with coarse
mm:ssmarkers, parsed leniently (truncated text degrades gracefully where truncated JSON would be unrecoverable — a lesson from ipgu). - timestamped — the whole file goes to ASR for word-level timing + a speaker backbone.
- align — the heart: the LLM's token stream is aligned against the ASR word stream with a Needleman–Wunsch edit-distance DP, transferring accurate word times onto the richer LLM text.
- gap-fill — anywhere ASR heard speech that no LLM segment covers, an ASR fallback is inserted so nothing is dropped.
- consistency — the ASR voice clusters act as a global backbone, merging the LLM's per-chunk labels into stable speakers (and fixing ASR over-splits).
- identify (level 3) — a reasoning model maps
Speaker A→ real names using context + a voice-cluster hint. - finalize — overlap fixes, clamping, readable subtitle-sized blocks, and SRT/MD/JSON.
Note
The hard parts of this problem are chunk overlap and alignment. offmute-v2 partitions overlap by ownership (each word is emitted by exactly one chunk) so sentences are never double-printed at seams, and aligns the whole chunk in one DP pass so common words like "it" can't mis-match to a later occurrence.
This repo is also an agent-vs-agent experiment. offmute-v2 was built twice, from one identical prompt, by two different models running in Claude Code — a head-to-head on a hard, AI-resistant build (fusing the ideas of offmute, meeting-diary, and ipgu into one timestamp-accurate diarizer):
| Branch | Built by | npm tag | What it is |
|---|---|---|---|
master |
GLM line + post-launch fixes | offmute-v2@latest |
the daily-driven, hardened build (this README) |
glm |
GLM-5.2 | offmute-v2@glm |
the frozen GLM experiment build |
opus |
Claude Opus 4.8 | offmute-v2@opus |
the frozen Opus experiment build |
npx offmute-v2@glm meeting.mp4 # the GLM build, exactly as submitted
npx offmute-v2@opus meeting.mp4 # the Opus build, exactly as submittedThe glm and opus branches have independent histories — each is the full, unedited commit
trail of that model's build, every review round, and every fix. The headline finding (spoiler):
once a chunk-overlap dedup bug is accounted for, the two are a near dead-heat on accuracy, and the
differences are in code conventions, error DX, and packaging. The full analysis is in the article.
Everything is open for inspection:
- Per-branch review trails (on
glm/opus):docs/spec.md,docs/review-1/,docs/review-2/(run-throughs, code reads, an independent review, and each model's diagnosed fixes), and the append-onlyintermediates/process_log_*.mddev journals. - Releasing & npm tags:
RELEASING.md— published via GitHub Actions npm Trusted Publishing (OIDC), no tokens, with provenance.
- Separation — who speaks when.
- Anonymous-consistent —
Speaker A/B, stable across the whole file (default). - Identification — real names inferred from context (needs
DEEPSEEK_API_KEY, or point--reasonerat another provider). Use--instructionsto steer (e.g. "everyone except the host is 'Audience'").
- Node ≥ 20 and
ffmpeg/ffprobeonPATH(CLI / library). The browser build uses ffmpeg.wasm instead. - API keys (from env or the
apiKeysoption):GEMINI_API_KEY(orGOOGLE_API_KEY) — required.ASSEMBLYAI_API_KEY— required for timing (or use--timestamped whisper-groq+GROQ_API_KEY).DEEPSEEK_API_KEY— optional, for--level 3name identification.
git clone https://github.com/SouthBridgeAI/offmute-v2.git
cd offmute-v2 # master = the primary build
npm ci
npm run typecheck && npm run lint && npm test
npm run build # tsup → dist/ (node + browser bundles)
npm run dev -- meeting.mp4 # run from sourceTo compare against the experiment builds, check out the glm or
opus branch (the Opus build uses Bun: bun install && bun run build).
Built on three predecessors, with their hard-won lessons carried forward:
- offmute — multimodal describe→transcribe, diarization, tone.
- meeting-diary — ASR word-timestamps + speaker diarization.
- ipgu — chunk/merge discipline and structured extraction from LLM output.
Created by Southbridge. Thanks to the model teams — including z.ai for GLM-5.2 and Anthropic for Claude Opus.
License: Apache-2.0.