Benchmark on 200 careful-prompted ("claude-anti-ai") English texts of ~600 words each, covering compliance / IT / science domains.

A. Detectors

Two detectors are used to score every method. Each catches a different failure mode.

desklib v1.01 (Layer 1a, surface)

live in API
Type
DeBERTa-v3-large classifier
Speed
~0.5s per text on CPU

Detects cliché-LLM (100% recall on default-ai). Blind to careful-prompted text (0% recall on claude-anti-ai). Cheap, fast, free.

Claude-Opus judge ensemble (Layer 1b, semantic)

research-only
Type
Claude Opus 4.7 ensemble of N agents
Speed
~5-15s per text per agent (parallel)
Baseline anti-AI recall
49.0%

Catches cross-corpus template patterns desklib misses. 53% recall on anti-AI prompted vs 0% for desklib. Single call is noisy (variance 0.21-0.62) — ensemble of 5+ recommended.

DivEye (Layer 1c, surprisal-based)

research-only
Type
GPT-2 small token-surprisal + LogReg classifier
Speed
~0.5s per text on CPU

TMLR Feb 2026, arXiv:2509.18880. 10-dim feature vector (mean/std/skew/kurt/acf-1 of surprisal sequence). Trained on 706 AI + 32 Wikipedia, CV accuracy 99.59%, ROC AUC 0.96. Catches careful-prompted text where desklib is blind (100% vs 0%).

B. Paraphrasers / humanizers

Higher %flagged means the method makes text more AI-looking to that detector — i.e. an anti-humanizer on careful-prompted source. Lower is better.

Method Engine Speed desklib %flagged claude-judge %flagged Status Action
Baseline (no paraphrasing)

Original text as-is, no paraphrasing applied.

desklib is blind to careful-prompted text (0% flagged anti-AI); catches 100% cliché.

instant 0.0% (n=200) live in API Read details ↑
humarin (T5-base paraphraser)

Local T5 paraphraser + post-process. Adds em-dashes and word substitutions.

Anti-helper on anti-AI source: makes 48% of carefully-prompted texts LOOK MORE AI to desklib. Drop mean −0.49.

humarin/chatgpt_paraphraser_on_T5_base (local, MPS) ~5s per text 48.0% (n=200) live in API Try in /humanise →
DIPPER-XXL (11B paraphraser, best-of-3)

Large adversarial paraphraser from RADAR paper. Best of 3 samples per text.

Partial run (105/200). Mean drop -0.109, 4× less destructive than humarin on anti-AI source.

kalpeshk2011/dipper-paraphraser-xxl (local, bf16, MPS) ~60s per text × 3 samples 4.8% (n=105) live in API Try in /humanise →
Claude-rewrite (7 tactics)

Claude Opus 4.7 rewrites in diversified styles (practitioner / news / op-ed / case-study / polemic / conversational / scholarly).

Perfectly invisible to desklib (0%) but 73% flagged by Claude-judge ensemble. Each tactic creates its own cross-corpus signature.

Tactic n mean desklib % flagged
A 32 0.005 0.0%
B 30 0.004 0.0%
C 28 0.011 0.0%
D 29 0.006 0.0%
E 24 0.006 0.0%
F 27 0.001 0.0%
G 30 0.017 0.0%
Claude Opus 4.7 via subagent dispatch ~30s per text 0.0% (n=200) 73.5% (n=200) research-only Read details ↑
Claude-translator T1-DE (round-trip)

Claude Opus 4.7 translates EN→DE→EN with NE-preservation. Roundtrip back-translation via subagent dispatch.

Round-trip via Claude (not NLLB). NEs preserved verbatim. Output 91-100% of source length. Mean drop -0.000. Sample N=20. Quality vastly exceeds NLLB-200-distilled-600M (which gave 0.43× ratio + repetition loops on same texts).

Claude Opus 4.7 via 4 parallel subagents ~30s per text 0.0% (n=20) 35.0% (n=20) research-only Read details ↑
Claude-translator T2-ZH (Chinese pivot)

Claude Opus 4.7 round-trip EN→Simplified Chinese→EN. Distant-pivot translation forces deeper syntactic restructuring.

Chinese-pivot more aggressive restructuring than German. NEs verbatim. Length 1.01-1.06× source. Mean delta from baseline +0.003, % flagged 0.0%. Sample N=20.

Claude Opus 4.7 via 4 parallel subagents ~50s per text 0.0% (n=20) 35.0% (n=20) research-only Read details ↑

C. Methodology

Every method runs against the same 200 careful-prompted English texts ("claude-anti-ai" corpus, ~600 words each, drawn from compliance / IT / science domains). Each output is scored by both detectors. The %flagged column counts texts where the detector's score ≥ 0.5; the mean is the average raw score across all texts in that cell's sample.

Sample sizes (n=…) differ between methods because DIPPER and Claude-judge are still mid-batch — we publish partial results rather than wait for full coverage. Numbers will move as batches complete.

The full write-up (per-method breakdown, cost analysis, failure modes, cascade results) lives in the final benchmark report.

D. Verdict

+----------------------------------------------------------------------+
|  Category                | Winner              | Number              |
+----------------------------------------------------------------------+
|  Best vs cliché-LLM      | desklib detector    | 100% recall on AI   |
|  Best vs careful-prompted| Claude-judge        | 49% recall baseline |
|                          |                     | 73.5% post-rewrite  |
|  Best humanizer (desklib)| Claude-rewrite (7G) | 0.0% flagged        |
|  Worst (anti-humanizer)  | humarin             | 48.0% flagged       |
|  Best speed/cost ratio   | humarin             | ~5s, free, local    |
|  Largest model tested    | DIPPER-XXL (11B)    | ~60s × 3 samples    |
|  Most defensive (research)| Claude-rewrite     | 7 tactics A..G      |
+----------------------------------------------------------------------+

Reading note: "best" on desklib means lowest %flagged — the method leaves careful-prompted text undetected. On Claude-judge "best" is inverted: a high %flagged on careful-prompted source means the detector recognises the rewrite as machine-shaped, which is what you want from a detector you sell.