# Final benchmark — humaniser.eu corpus-v2 (200 claude-anti-ai texts)

**Date:** 2026-05-14
**Author:** Alexander Brichkin (Agonist Development AB)
**Source:** 200 carefully-prompted claude-anti-ai texts (corpus-v2)
**Detectors:** desklib v1.01 (Layer 1a) + Claude-Opus-judge ensemble of 7 (Layer 1b)

---

## 1. Methods compared (all paraphrasers / humanizers on the same 200 source texts)

| # | Method | Engine | Cost / 200 texts |
|---|---|---|---|
| 1 | Baseline (no paraphrasing) | — | $0 |
| 2 | humarin paraphrase + postprocess | local T5-base 880MB | $0 (~5s/text) |
| 3 | Claude-rewrite (7 tactics A-G) | Opus 4.7 via 7 subagents | $0 (subscription) |

---

## 2. Headline results

### Detector: desklib v1.01 (Layer 1a)

| Method | Mean score | Median | % flagged AI (≥0.5) |
|---|---|---|---|
| Baseline (no paraphrasing) | 0.006 | 0.002 | 0.0% |
| humarin paraphrase | 0.500 | 0.486 | 48.0% |
| **Claude-rewrite (7 tactics)** | **0.007** | **0.002** | **0.0%** |

### Detector: Claude-judge ensemble (Layer 1b, 7 Opus subagents averaged)

| Method | Mean score | Per-agent variance (min..max) | % flagged AI (mean ≥0.5) |
|---|---|---|---|
| Baseline (no paraphrasing) | 0.422 | 0.21..0.62 across 7 agents | 49.0% |
| **Claude-rewrite (7 tactics)** | **0.590** | 0.366..0.76 across 7 agents | **73.5%** |

---

## 3. Method-by-method breakdown

### Method 1: Baseline (no paraphrasing)
- desklib mean: 0.006 → **0.0% flagged**
- Claude-judge ensemble mean: 0.422 → **49.0% flagged**
- Interpretation: desklib alone is blind to careful-prompted text. Claude-judge ensemble catches roughly half.

### Method 2: humarin paraphrase
- desklib mean: 0.500 (mean drop from baseline: +0.493)
- **Effect on the source style: NEGATIVE.** humarin's stylistic markers (em-dashes, word substitutions) trigger desklib that was blind to the original — humarin is an anti-humanizer on careful-prompted text.
- 0/200 texts benefited from humarin. 48% now flagged as AI by desklib.

### Method 3: Claude-rewrite — overall and per-tactic
- desklib mean: 0.007 → **0% flagged** (perfectly invisible to surface classifier)
- Claude-judge mean: 0.590 → **73.5% flagged**

**Per-tactic effectiveness on Claude-judge ensemble** (lower = better evasion):

| Tactic | N | Mean Claude-judge score | desklib (all 0%) |
|---|---|---|---|
| A | 32 | **0.584** | 0% |
| B | 30 | **0.599** | 0% |
| C | 28 | **0.636** | 0% |
| D | 29 | **0.633** | 0% |
| E | 24 | **0.543** | 0% |
| F | 27 | **0.461** | 0% |
| G | 30 | **0.657** | 0% |

Tactic legend: A=practitioner, B=news-feature, C=op-ed, D=case-study, E=polemic, F=conversational, G=scholarly

---

## 4. Key insights

1. **Layer 1a (desklib) + Layer 1b (Claude-judge) are complementary, not redundant.** Their agreement on claude-anti-ai is only ~51% in the baseline; Claude-judge sees what desklib misses.

2. **humarin is NOT a universal humanizer.** On cliché-LLM source it reduces detection; on careful-prompted source it INCREASES detection. A source-quality classifier should gate humarin in production.

3. **Claude-rewrite (7 tactics) is the only method that beats desklib on careful source** — 0/200 flagged. But it does NOT beat Claude-judge ensemble — every tactic still produces a recognizable cross-corpus signature.

4. **Prescriptive prompting always creates signatures.** Each of the 7 tactics produced its own scaffold (D=Case study/Background, C=My thesis, A=we thought we understood X, G=is reported to/is necessary, etc.) that Claude-judge can identify cross-corpus. Suppressing one set of phrases (Critics dispute / Three things) created a new set.

5. **Inter-judge variance is large** (range 0.21–0.84 across 7 agents on same texts). Ensemble is essential; single-call Claude-judge is too noisy.

6. **Best tactic against Claude-judge: F (conversational) and A (practitioner first-person).** Worst: G (scholarly/academic register).

---

## 5. Product implications

- **Tier-aligned architecture:** Free tier uses desklib only (cheap, catches cliché-LLM). Studio+/Lab+ tier uses ensemble (desklib + Claude-judge ensemble of 5). Different recall, different price points.
- **Honest marketing:** "We catch what surface detectors miss — see calibration cards." Do not claim 99%. Publish the 200-N benchmark with per-tactic recall.
- **Humanizer pipeline:** Source-quality classifier → cliché → humarin; careful → Claude-rewrite. Single-pass humanizer is not safe across input distributions.
- **Detector V2 training:** the 200 claude-anti-ai texts + their 200 Claude-rewrite outputs = 400 hard-negative training examples for desklib successor.

---

## 6. Outstanding work
- **DIPPER batch** on 200 claude-anti-ai (running, ETA ~2.5h). Will show whether DIPPER differs from humarin on careful-prompted source.
- **default-ai humarin/DIPPER/Claude-rewrite** — after gemma finishes corpus (currently 707/2000) — to show the opposite distribution where humarin/DIPPER are net-positive.
- **Cascade pipelines (humarin→DIPPER, humarin→Claude, DIPPER→Claude)** — pending. May surface counter-cascades that help on careful source.