Model card · public abridged

DeepBlocker Detect

Telephony-tuned audio deepfake detector. Production model: xlsr-mamba-g711-v5. This page is the public abridged version of the internal model card. Methodology and headline numbers are reproducible from the open base model; corpus, weights, and inference endpoint are gated.

Why this model exists

We started from XLSR-Mamba-LA (MIT, 2024), an academic detector that performs strongly on clean studio audio (ASVspoof 2021 Logical Access). When deployed to a live G.711 telephony pipeline, the published model was effectively inverted: real human voices on phone lines were flagged as deepfakes, and smooth synthetic voices were called real.

Diagnosis: the model had learned vocoder artefacts on clean audio. G.711 lossy compression introduces spectral distortions that look superficially similar. Solution: fine-tune the classification head on G.711-encoded audio with correct labels, keeping the SSL frontend frozen so the rich multilingual feature representation is preserved. Five iterations later, v5 is the model in production.

Headline performance

Measured on a held-out Reporting set (n ≈ 634, SHA-256 pinned, used exactly once) at the production operating point.

98.5% mean confidence on ElevenLabs deepfakes — the voice-cloning provider fraudsters actually use. Auto-flagged on 99% of attempts.
94.7% deepfake recall — auto-catches roughly 95 of 100 fraud calls.
4.9% false-alarm rate — only 1 in 20 FAKE-band predictions is wrong.
10.2% miss rate — 1 in 10 deepfakes escapes the FAKE band; most slip into UNCERTAIN, not REAL.
0.9% review-queue load — only ~1 in 110 calls needs human eyes.
ROC AUC ≥ 0.92 on the locked Reporting set. Same population is preserved across re-evaluations via the SHA-256 manifest.

Per-source separation

Real sources cluster well below the FAKE band; fake sources cluster well above. Mean fake-probability scores on the Reporting set:

Common Voice (REAL)	0.05
LibriSpeech (REAL)	0.35
LibriSpeech-other (REAL)	0.42
Human PSTN — Twilio callers (REAL)	0.44
AI-vs-AI synthetic caller (FAKE)	0.88
OpenAI Realtime (FAKE)	0.90
ElevenLabs (FAKE)	0.985
OpenAI tts-1-hd (FAKE)	0.998

REAL band: fake_prob ≤ 0.50. FAKE band: fake_prob ≥ 0.95. Between: UNCERTAIN, sent to human review.

Training data

The 18,062-sample corpus is split 6.2 : 1 REAL : FAKE. Provenance and licences below. Stratified group splits prevent speaker, voice, and call-id leakage across train, val, test, and Reporting partitions.

Real — human PSTN (Twilio test calls)	202
Real — LibriSpeech (CC-BY 4.0)	7,350
Real — Common Voice 25.0 EN (CC0-1.0)	8,000
Fake — OpenAI Realtime API	388
Fake — synthetic_caller (AI-vs-AI calls)	122
Fake — ElevenLabs eleven_multilingual_v2	1,200
Fake — OpenAI tts-1-hd	800
Total	18,062

Augmentation chain (applied identically to every sample): bandlimit to 300–3,400 Hz, optional simple reverb, white noise at 8–35 dB SNR, AGC + hard limiter, random codec round-trip across G.711 µ-law / G.711 A-law / G.722.

Training procedure

Base: XLSR-Mamba-LA (MIT, 2024). SSL frontend (XLSR-300M Wav2Vec2, 315 M params) frozen.
Trainable head: DuaBiMamba (12 stacked Mamba blocks, hidden 256) + 2-class classifier — 1.9 M trainable parameters.
Loss: BCEWithLogitsLoss with pos_weight = n_real / n_fake. Optimiser: AdamW, weight decay 0.01. Learning rate 5e-5 with cosine decay → 5e-7, 100-step warmup.
Effective batch size 32 (8 × 4 grad-accumulation). Window length 4 s. Patience-5 early stop. Best epoch typically reached in 1–2 of warm-start.
Hardware: A100-40GB. Total wall-clock for v5: ~50 minutes.

Honest limitations

OpenAI Realtime / GPT-4o-voice currently lands in UNCERTAIN, not FAKE. Auto-disconnect does not fire on it; it is surfaced to the review queue. v6 roadmap adds OpenAI Realtime training samples to push the mean past the FAKE band threshold.
Miss rate of 10.2% is a deliberate trade-off versus an earlier iteration that had 7.3% — v5 chose tighter REAL clustering (real callers correctly classified as REAL) over tighter FAKE clustering. Most of the additional misses are borderline OpenAI Realtime / synthetic_caller cases, not ElevenLabs.
Reporting set size is ~634 samples with a per-source minimum of 50. Adequate for headline numbers but with wider confidence intervals than a 1,000-sample academic benchmark.
TTS providers we did not train on (Bark, Coqui XTTS-v2, Vall-E, custom voice clones) are not represented in evaluation.
Multilingual support is not in v5. The corpus is English-only; separate fine-tunes per language are on the v6 roadmap.

Reproducibility

Every claim on this page is backed by a measurement against a Reporting set held out of training and used exactly once. The set is locked via SHA-256 hash of the sorted per-sample (path, label) tuples. Future re-evaluations are measurements of the same population — no silent re-use.

The base model (XLSR-Mamba-LA) is MIT-licensed and published with paper. Our fine-tune procedure, augmentation chain, training hyperparameters, and Reporting SHA-256 are documented in the internal model card. Available under NDA for technical evaluation.

Versioning and what's next

v5 has been live in production since 2026-05-04. The v6 plan adds OpenAI Realtime and GPT-4o-voice training samples, additional synthetic_caller examples, defence against TTS providers we don't currently see (Bark, Coqui, Vall-E), production-traffic continuous evaluation, and multilingual fine-tunes (FR / ES / DE). v6 acceptance bar: ROC AUC stays ≥ 0.92 on G.711 aggregate.

For platforms

Want to integrate DeepBlocker Detect?

Detect is available as a hosted API for fraud, KYC, and identity-verification platforms. Same model that powers Real-Time Protection.

Back to DeepBlocker →