Methodology

Each card is produced against a held-out evaluation set drawn from the RAID-style multilingual corpus, augmented with an arXiv academic-prose subset for English and translation-equivalent subsets for the other nine languages. The operating threshold is fixed at FPR 5% on the calibration window; verdicts in the abstain band between the two thresholds return manual review rather than a forced classification.

We publish the calibration card before the verdict. A scan returns a receipt that references the card SHA used to produce it. An auditor can fetch the card, confirm the model SHA, and reproduce the verdict against the published thresholds. The samples figure refers to the held-out evaluation set, not the full training corpus.

English (en)

abc123
AUROC @ FPR = 5%
94.0%
Sample size
4,500
Last calibrated

German (de)

abc124
AUROC @ FPR = 5%
91.0%
Sample size
3,800
Last calibrated

French (fr)

abc125
AUROC @ FPR = 5%
90.0%
Sample size
3,600
Last calibrated

Spanish (es)

abc126
AUROC @ FPR = 5%
90.0%
Sample size
3,500
Last calibrated

Italian (it)

abc127
AUROC @ FPR = 5%
89.0%
Sample size
3,200
Last calibrated

Dutch (nl)

abc128
AUROC @ FPR = 5%
88.0%
Sample size
2,900
Last calibrated

Polish (pl)

abc129
AUROC @ FPR = 5%
87.0%
Sample size
2,700
Last calibrated

Portuguese (pt)

abc12a
AUROC @ FPR = 5%
88.0%
Sample size
2,800
Last calibrated

Romanian (ro)

abc12b
AUROC @ FPR = 5%
85.0%
Sample size
2,100
Last calibrated

Lower sample density; v0.4 calibration window will increase coverage.

Swedish (sv)

abc12c
AUROC @ FPR = 5%
86.0%
Sample size
2,200
Last calibrated