25 ML eval claims. 9 falsifiability criteria. One scorecard.
A structural reading of how well-known evaluation claims — proprietary and open — match the minimum-hygiene format PRML defines. The score is mechanical: it counts whether the claim's canonical public source names a metric, a threshold direction, a dataset hash, a model version, a sample size, a seed, a pre-registration date, and a baseline. Nothing else.
What this is not. Not a moral ranking. Not an accusation. Not authoritative. A claim scoring 3/9 may be more useful in practice than one scoring 8/9. PRML §8.1 names the limit: a high score means the claim is checkable, not that it is true.
Disagree with a score? Open an issue at github.com/studio-11-co/falsify-integrity-index with the public source link and the field you think we missed. Re-scoring is cheap.
The nine criteria
- Metric named. The score has a label (accuracy, refusal-rate, F1).
- Numeric value. A scalar is given, not a vague "state-of-the-art".
- Dataset named. The eval set is identifiable (HumanEval, GPQA, MMLU).
- Dataset hash / version pin. A specific revision or content hash is recorded.
- Model version pinned. Not "GPT-4" — a build, date, or revision.
- Threshold direction stated. "≥ 0.95" not "around 95%".
- Sample size given. N for the eval run, not just "evaluated thoroughly".
- Seed published. The RNG state, when applicable.
- Pre-registration date. A timestamp showing the threshold was set before the run.
Have a claim you want scored?
Two paths. Open an issue with the public source link, or run the heuristic checker yourself at falsify.dev/check.
Pre-registering your own next eval is faster than arguing about someone else's — eight YAML fields, one SHA-256.
The Index is a snapshot dated 2026-Q2. Each entry's score reflects the canonical public source as of the link cited. Vendors revise model cards and papers; we re-score on receipt of an updated public source. The Index is not affiliated with, endorsed by, or written in cooperation with any of the listed publishers. CC0; mirror, fork, dispute.