PRML Integrity Index · 2026-Q2 · v1

25 ML eval claims. 9 falsifiability criteria. One scorecard.

A structural reading of how well-known evaluation claims — proprietary and open — match the minimum-hygiene format PRML defines. The score is mechanical: it counts whether the claim's canonical public source names a metric, a threshold direction, a dataset hash, a model version, a sample size, a seed, a pre-registration date, and a baseline. Nothing else.

What this is. A heuristic audit of how the published format of each claim compares to the eight-field PRML manifest. Two claims with the same accuracy can score very differently here — that's the point. The score reflects what the publisher chose to record, not whether the model is good.

What this is not. Not a moral ranking. Not an accusation. Not authoritative. A claim scoring 3/9 may be more useful in practice than one scoring 8/9. PRML §8.1 names the limit: a high score means the claim is checkable, not that it is true.

Disagree with a score? Open an issue at github.com/studio-11-co/falsify-integrity-index with the public source link and the field you think we missed. Re-scoring is cheap.

Entries

claims, public sources only

Median score

—

of 9 criteria

Top quartile

—

claims at ≥7/9

Bottom quartile

—

claims at <4/9

click any row to expand criteria breakdown

The nine criteria

Metric named. The score has a label (accuracy, refusal-rate, F1).
Numeric value. A scalar is given, not a vague "state-of-the-art".
Dataset named. The eval set is identifiable (HumanEval, GPQA, MMLU).
Dataset hash / version pin. A specific revision or content hash is recorded.
Model version pinned. Not "GPT-4" — a build, date, or revision.
Threshold direction stated. "≥ 0.95" not "around 95%".
Sample size given. N for the eval run, not just "evaluated thoroughly".
Seed published. The RNG state, when applicable.
Pre-registration date. A timestamp showing the threshold was set before the run.

Have a claim you want scored?

Two paths. Open an issue with the public source link, or run the heuristic checker yourself at falsify.dev/check.

Pre-registering your own next eval is faster than arguing about someone else's — eight YAML fields, one SHA-256.

Score a claim → Read the spec Submit re-score

The Index is a snapshot dated 2026-Q2. Each entry's score reflects the canonical public source as of the link cited. Vendors revise model cards and papers; we re-score on receipt of an updated public source. The Index is not affiliated with, endorsed by, or written in cooperation with any of the listed publishers. CC0; mirror, fork, dispute.