PRML vs adjacent primitives.
PRML is a small open spec. It is not the only thing that touches commitment, provenance, evaluation, or pre-registration. This page is a candid note from the spec's author about where PRML overlaps with existing work, where it differs, and where the right answer is to use them together.
| Primitive | What it anchors | Relationship |
|---|---|---|
| PRML | An ML evaluation claim (threshold + metric + dataset split + model version) committed before the run | — |
| in-toto | Software supply-chain steps (who built what, in what order) | Complementary · embed PRML hash as step input/output |
| SLSA | Software artifact provenance levels (build trustworthiness) | Complementary · SLSA L3 build can produce the model whose eval PRML commits |
| Sigstore | Software artifacts signed under OIDC identity, time-stamped in transparency log | Complementary · wrap a PRML hash inside a Sigstore attestation |
| Model Cards | Free-form model documentation (intended use, limits, evaluation results) | Complementary · cite a PRML hash in the evaluation section |
| HELM | Curated benchmark suite with standard metrics | Complementary · HELM run-metadata can include a PRML commitment |
| OSF / ClinicalTrials.gov | Centralised study pre-registration (with state-anchored timestamps for trials) | Conceptual analogue · PRML is the decentralised ML version, the hash itself is the anchor |
| NeurIPS / MLRC checklists | Author-completed protocol declarations (what was done, what was reported) | Complementary · checklists describe, PRML proves |
vs in-toto
in-toto is a framework for cryptographically capturing software supply-chain steps — which person or builder ran which step in which order, and what artifacts flowed between them. It originated at NYU SSL and has matured into widely used industrial tooling.
The overlap is conceptual: both produce signed records of "this happened before that." The difference is target. in-toto records build steps; PRML records evaluation claims. A PRML manifest could plausibly appear as an input artifact or output product within an in-toto step record, but PRML doesn't try to capture the build pipeline that produced the model — only the claim about how the model was supposed to perform.
vs SLSA
SLSA (Supply-chain Levels for Software Artifacts) is a Google-originated framework that defines provenance levels (L1-L4) for build trustworthiness. It standardises what a build system has to attest to in order to claim each level.
SLSA answers "is this binary trustworthy?". PRML answers "was this evaluation claim fixed in advance?". The two address different stages: SLSA validates the artifact's origin; PRML validates the protocol bound to the artifact's evaluation. A model produced by a SLSA L3 pipeline can have its evaluation claim PRML-committed; both attestations apply, neither replaces the other.
vs Sigstore
Sigstore signs and time-stamps software artifacts using OIDC-backed identities and a transparency log (Rekor). It removes the operational pain of long-lived signing keys.
Sigstore is identity-attested signing; PRML is content-addressed commitment. They operate at different layers and compose cleanly: a PRML manifest hash can be wrapped inside a Sigstore attestation so the receipt has both a content-anchor (the SHA-256) and an identity-anchor (the signer). For institutional clients that want both "this claim was committed in advance" and "this organisation took responsibility for committing it", Sigstore + PRML is a natural pairing.
vs Model Cards
Model Cards (Mitchell et al., 2019) are structured-but-flexible documents capturing a model's intended use, training data, ethical considerations, and evaluation results. The Hugging Face implementation makes them widely available; the Google AI Hub variant is more enterprise-oriented.
Model Cards describe; PRML proves. The "evaluation results" section of a Model Card is a natural surface to embed a PRML hash, turning a textual claim ("our accuracy is 0.76 on ImageNet-Val") into a re-derivable receipt ("here is the SHA-256 we committed to before running, and here is the permalink anyone can verify against"). The two are not in conflict — PRML is the cryptographic anchor that gives the Model Card's evaluation section verifiability.
vs HELM
HELM (Holistic Evaluation of Language Models, Stanford CRFM) is a curated benchmark suite that evaluates LLMs across many scenarios with standardised metrics. The HELM dashboard publishes per-model scores; the codebase manages the harness.
HELM is downstream of any commitment. PRML is upstream: it commits which scenarios, metrics, and thresholds a user is binding themselves to before running the harness. A HELM run-metadata payload could optionally include a PRML hash; the two solve different parts of the eval rigor surface (HELM = comprehensive eval coverage; PRML = pre-run commitment receipt).
vs OSF / ClinicalTrials.gov pre-registration
The closest conceptual analogue. ClinicalTrials.gov (US government) and OSF (Open Science Framework, non-profit) both implement centralised pre-registration — researchers submit study designs in advance, a registry timestamps them, and reviewers can later verify that the published study matches the registered design.
PRML is the decentralised ML version. Three differences:
- No central authority. The SHA-256 hash itself is the anchor. It can be committed to
registry.falsify.dev, a public git repo, an arXiv preprint, an in-toto layout, a blockchain — wherever the publisher wants. Anyone can re-derive the hash from the manifest bytes. - Designed for ML eval shape. The 8 fields cover ML-specific commitments (metric, threshold, dataset split, model version) that don't map cleanly onto OSF's clinical-trial-shaped templates.
- Cryptographic verifiability. ClinicalTrials.gov relies on the registry being trustworthy. PRML doesn't require trusting the registry — only the hash function (SHA-256) and the manifest bytes themselves.
vs Reproducibility checklists
NeurIPS and MLRC reproducibility checklists are author-completed protocol declarations: yes/no/N-A questions about what the authors did and reported.
Checklists are self-attestation in natural language. PRML is cryptographic record. Checklists capture intent and process; PRML captures a specific commitment that can be re-derived independently. Both are valuable — a paper that completes the NeurIPS checklist and includes a PRML hash in its evaluation section is strictly stronger than one that does either alone.
What PRML does not compete with
PRML deliberately stays narrow:
- Not a benchmark. It commits to whatever metric the user picks; it doesn't curate the metrics. HELM, BIG-bench, lm-evaluation-harness, etc. live one layer down.
- Not an audit framework. Section 8.1 of the spec acknowledges PRML doesn't solve selective publication. Pre-register ten claims, publish two — PRML can't see that.
- Not a compliance certification. EU AI Act, NIST AI RMF, ISO/IEC frameworks define standards; PRML provides one primitive (a tamper-evident eval receipt) those frameworks can cite.
- Not a model evaluation service. Studio 11's Diagnostic Sprint is a paid engagement that uses PRML; the spec itself is open and free.
If you're building eval-rigor infrastructure
Most teams will combine three or four of these primitives:
- Build / artifact: SLSA + Sigstore for the model artifact itself.
- Documentation: Model Card describing intended use, limits, and evaluation strategy.
- Evaluation commitment: PRML hash committed before each named claim, embedded in the Model Card's evaluation section and referenced in any paper or dashboard.
- Coverage: HELM, BIG-bench, lm-evaluation-harness for breadth.
- Reporting hygiene: NeurIPS / MLRC checklist for the paper.
None of these replace each other. PRML is the smallest possible piece; if your stack already covers the others, dropping in a PRML hash is a one-line addition that closes a specific gap (post-hoc threshold tuning) without restructuring anything.
Read the spec →