Method ratings

A. Detectors

Two detectors are used to score every method. Each catches a different failure mode.

Type: DeBERTa-v3-large classifier
Speed: ~0.5s per text on CPU

Detects cliché-LLM (100% recall on default-ai). Blind to careful-prompted text (0% recall on claude-anti-ai). Cheap, fast, free.

Type: Claude Opus 4.7 ensemble of N agents
Speed: ~5-15s per text per agent (parallel)
Baseline anti-AI recall: 49.0%

Catches cross-corpus template patterns desklib misses. 53% recall on anti-AI prompted vs 0% for desklib. Single call is noisy (variance 0.21-0.62) — ensemble of 5+ recommended.

Type: GPT-2 small token-surprisal + LogReg classifier
Speed: ~0.5s per text on CPU

TMLR Feb 2026, arXiv:2509.18880. 10-dim feature vector (mean/std/skew/kurt/acf-1 of surprisal sequence). Trained on 706 AI + 32 Wikipedia, CV accuracy 99.59%, ROC AUC 0.96. Catches careful-prompted text where desklib is blind (100% vs 0%).

B. Paraphrasers / humanizers

Higher %flagged means the method makes text more AI-looking to that detector — i.e. an anti-humanizer on careful-prompted source. Lower is better.

Method

Engine

Speed

desklib %flagged

claude-judge %flagged

Status

Action

Baseline (no paraphrasing)

Original text as-is, no paraphrasing applied.

desklib is blind to careful-prompted text (0% flagged anti-AI); catches 100% cliché.

—

instant

0.0% (n=200)

—

live in API

Read details ↑

humarin (T5-base paraphraser)

Local T5 paraphraser + post-process. Adds em-dashes and word substitutions.

Anti-helper on anti-AI source: makes 48% of carefully-prompted texts LOOK MORE AI to desklib. Drop mean −0.49.

humarin/chatgpt_paraphraser_on_T5_base (local, MPS)

~5s per text

48.0% (n=200)

—

live in API

Try in /humanise →

DIPPER-XXL (11B paraphraser, best-of-3)

Large adversarial paraphraser from RADAR paper. Best of 3 samples per text.

Partial run (105/200). Mean drop -0.109, 4× less destructive than humarin on anti-AI source.

kalpeshk2011/dipper-paraphraser-xxl (local, bf16, MPS)

~60s per text × 3 samples

4.8% (n=105)

—

live in API

Try in /humanise →

Claude-rewrite (7 tactics)

Claude Opus 4.7 rewrites in diversified styles (practitioner / news / op-ed / case-study / polemic / conversational / scholarly).

Perfectly invisible to desklib (0%) but 73% flagged by Claude-judge ensemble. Each tactic creates its own cross-corpus signature.

Tactic	n	mean desklib	% flagged
A	32	0.005	0.0%
B	30	0.004	0.0%
C	28	0.011	0.0%
D	29	0.006	0.0%
E	24	0.006	0.0%
F	27	0.001	0.0%
G	30	0.017	0.0%

Claude Opus 4.7 via subagent dispatch

~30s per text

0.0% (n=200)

73.5% (n=200)

research-only

Read details ↑

Claude-translator T1-DE (round-trip)

Claude Opus 4.7 translates EN→DE→EN with NE-preservation. Roundtrip back-translation via subagent dispatch.

Round-trip via Claude (not NLLB). NEs preserved verbatim. Output 91-100% of source length. Mean drop -0.000. Sample N=20. Quality vastly exceeds NLLB-200-distilled-600M (which gave 0.43× ratio + repetition loops on same texts).

Claude Opus 4.7 via 4 parallel subagents

~30s per text

0.0% (n=20)

35.0% (n=20)

research-only

Read details ↑

Claude-translator T2-ZH (Chinese pivot)

Claude Opus 4.7 round-trip EN→Simplified Chinese→EN. Distant-pivot translation forces deeper syntactic restructuring.

Chinese-pivot more aggressive restructuring than German. NEs verbatim. Length 1.01-1.06× source. Mean delta from baseline +0.003, % flagged 0.0%. Sample N=20.

Claude Opus 4.7 via 4 parallel subagents

~50s per text

0.0% (n=20)

35.0% (n=20)

research-only

Read details ↑

C. Methodology

Every method runs against the same 200 careful-prompted English texts ("claude-anti-ai" corpus, ~600 words each, drawn from compliance / IT / science domains). Each output is scored by both detectors. The %flagged column counts texts where the detector's score ≥ 0.5; the mean is the average raw score across all texts in that cell's sample.

Sample sizes (n=…) differ between methods because DIPPER and Claude-judge are still mid-batch — we publish partial results rather than wait for full coverage. Numbers will move as batches complete.

The full write-up (per-method breakdown, cost analysis, failure modes, cascade results) lives in the final benchmark report.

D. Verdict

+----------------------------------------------------------------------+
|  Category                | Winner              | Number              |
+----------------------------------------------------------------------+
|  Best vs cliché-LLM      | desklib detector    | 100% recall on AI   |
|  Best vs careful-prompted| Claude-judge        | 49% recall baseline |
|                          |                     | 73.5% post-rewrite  |
|  Best humanizer (desklib)| Claude-rewrite (7G) | 0.0% flagged        |
|  Worst (anti-humanizer)  | humarin             | 48.0% flagged       |
|  Best speed/cost ratio   | humarin             | ~5s, free, local    |
|  Largest model tested    | DIPPER-XXL (11B)    | ~60s × 3 samples    |
|  Most defensive (research)| Claude-rewrite     | 7 tactics A..G      |
+----------------------------------------------------------------------+

Reading note: "best" on desklib means lowest %flagged — the method leaves careful-prompted text undetected. On Claude-judge "best" is inverted: a high %flagged on careful-prompted source means the detector recognises the rewrite as machine-shaped, which is what you want from a detector you sell.

A. Detectors

desklib v1.01 (Layer 1a, surface)

Claude-Opus judge ensemble (Layer 1b, semantic)

DivEye (Layer 1c, surprisal-based)

B. Paraphrasers / humanizers

C. Methodology

D. Verdict