Benchmark on 200 careful-prompted ("claude-anti-ai") English texts of ~600 words each, covering compliance / IT / science domains.
Method ratings
200-text benchmark on careful-prompted English text (compliance / IT / science domains), evaluated by desklib v1.01 and a Claude-Opus judge ensemble. Snapshot v0.3-2026-05-14, taken 2026-05-14.
A. Detectors
Two detectors are used to score every method. Each catches a different failure mode.
desklib v1.01 (Layer 1a, surface)
live in API- Type
- DeBERTa-v3-large classifier
- Speed
- ~0.5s per text on CPU
Detects cliché-LLM (100% recall on default-ai). Blind to careful-prompted text (0% recall on claude-anti-ai). Cheap, fast, free.
Claude-Opus judge ensemble (Layer 1b, semantic)
research-only- Type
- Claude Opus 4.7 ensemble of N agents
- Speed
- ~5-15s per text per agent (parallel)
- Baseline anti-AI recall
- 49.0%
Catches cross-corpus template patterns desklib misses. 53% recall on anti-AI prompted vs 0% for desklib. Single call is noisy (variance 0.21-0.62) — ensemble of 5+ recommended.
DivEye (Layer 1c, surprisal-based)
research-only- Type
- GPT-2 small token-surprisal + LogReg classifier
- Speed
- ~0.5s per text on CPU
TMLR Feb 2026, arXiv:2509.18880. 10-dim feature vector (mean/std/skew/kurt/acf-1 of surprisal sequence). Trained on 706 AI + 32 Wikipedia, CV accuracy 99.59%, ROC AUC 0.96. Catches careful-prompted text where desklib is blind (100% vs 0%).
B. Paraphrasers / humanizers
Higher %flagged means the method makes text more AI-looking to that detector — i.e. an anti-humanizer on careful-prompted source. Lower is better.
| Method | Engine | Speed | desklib %flagged | claude-judge %flagged | Status | Action | ||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Baseline (no paraphrasing)Original text as-is, no paraphrasing applied. desklib is blind to careful-prompted text (0% flagged anti-AI); catches 100% cliché. | — | instant | 0.0% (n=200) | — | live in API | Read details ↑ | ||||||||||||||||||||||||||||||||
humarin (T5-base paraphraser)Local T5 paraphraser + post-process. Adds em-dashes and word substitutions. Anti-helper on anti-AI source: makes 48% of carefully-prompted texts LOOK MORE AI to desklib. Drop mean −0.49. | humarin/chatgpt_paraphraser_on_T5_base (local, MPS) | ~5s per text | 48.0% (n=200) | — | live in API | Try in /humanise → | ||||||||||||||||||||||||||||||||
DIPPER-XXL (11B paraphraser, best-of-3)Large adversarial paraphraser from RADAR paper. Best of 3 samples per text. Partial run (105/200). Mean drop -0.109, 4× less destructive than humarin on anti-AI source. | kalpeshk2011/dipper-paraphraser-xxl (local, bf16, MPS) | ~60s per text × 3 samples | 4.8% (n=105) | — | live in API | Try in /humanise → | ||||||||||||||||||||||||||||||||
Claude-rewrite (7 tactics)Claude Opus 4.7 rewrites in diversified styles (practitioner / news / op-ed / case-study / polemic / conversational / scholarly). Perfectly invisible to desklib (0%) but 73% flagged by Claude-judge ensemble. Each tactic creates its own cross-corpus signature.
| Claude Opus 4.7 via subagent dispatch | ~30s per text | 0.0% (n=200) | 73.5% (n=200) | research-only | Read details ↑ | ||||||||||||||||||||||||||||||||
Claude-translator T1-DE (round-trip)Claude Opus 4.7 translates EN→DE→EN with NE-preservation. Roundtrip back-translation via subagent dispatch. Round-trip via Claude (not NLLB). NEs preserved verbatim. Output 91-100% of source length. Mean drop -0.000. Sample N=20. Quality vastly exceeds NLLB-200-distilled-600M (which gave 0.43× ratio + repetition loops on same texts). | Claude Opus 4.7 via 4 parallel subagents | ~30s per text | 0.0% (n=20) | 35.0% (n=20) | research-only | Read details ↑ | ||||||||||||||||||||||||||||||||
Claude-translator T2-ZH (Chinese pivot)Claude Opus 4.7 round-trip EN→Simplified Chinese→EN. Distant-pivot translation forces deeper syntactic restructuring. Chinese-pivot more aggressive restructuring than German. NEs verbatim. Length 1.01-1.06× source. Mean delta from baseline +0.003, % flagged 0.0%. Sample N=20. | Claude Opus 4.7 via 4 parallel subagents | ~50s per text | 0.0% (n=20) | 35.0% (n=20) | research-only | Read details ↑ |
C. Methodology
Every method runs against the same 200 careful-prompted English texts ("claude-anti-ai" corpus, ~600 words each, drawn from compliance / IT / science domains). Each output is scored by both detectors. The %flagged column counts texts where the detector's score ≥ 0.5; the mean is the average raw score across all texts in that cell's sample.
Sample sizes (n=…) differ between methods because DIPPER
and Claude-judge are still mid-batch — we publish partial results
rather than wait for full coverage. Numbers will move as batches
complete.
The full write-up (per-method breakdown, cost analysis, failure modes, cascade results) lives in the final benchmark report.
D. Verdict
+----------------------------------------------------------------------+ | Category | Winner | Number | +----------------------------------------------------------------------+ | Best vs cliché-LLM | desklib detector | 100% recall on AI | | Best vs careful-prompted| Claude-judge | 49% recall baseline | | | | 73.5% post-rewrite | | Best humanizer (desklib)| Claude-rewrite (7G) | 0.0% flagged | | Worst (anti-humanizer) | humarin | 48.0% flagged | | Best speed/cost ratio | humarin | ~5s, free, local | | Largest model tested | DIPPER-XXL (11B) | ~60s × 3 samples | | Most defensive (research)| Claude-rewrite | 7 tactics A..G | +----------------------------------------------------------------------+
Reading note: "best" on desklib means lowest %flagged — the method leaves careful-prompted text undetected. On Claude-judge "best" is inverted: a high %flagged on careful-prompted source means the detector recognises the rewrite as machine-shaped, which is what you want from a detector you sell.