Comparison · 2026-05-07 · spec author write-up

PRML vs adjacent primitives.

PRML is a small open spec. It is not the only thing that touches commitment, provenance, evaluation, or pre-registration. This page is a candid note from the spec's author about where PRML overlaps with existing work, where it differs, and where the right answer is to use them together.

One-line PRML commits an ML evaluation claim (threshold, metric, dataset split) to a SHA-256 hash before the run. Other primitives anchor software artifacts (Sigstore, in-toto, SLSA), describe model properties (Model Cards), curate benchmarks (HELM), or register study designs (OSF). They are largely complementary, not substitutes.
Primitive What it anchors Relationship
PRMLAn ML evaluation claim (threshold + metric + dataset split + model version) committed before the run
in-totoSoftware supply-chain steps (who built what, in what order)Complementary · embed PRML hash as step input/output
SLSASoftware artifact provenance levels (build trustworthiness)Complementary · SLSA L3 build can produce the model whose eval PRML commits
SigstoreSoftware artifacts signed under OIDC identity, time-stamped in transparency logComplementary · wrap a PRML hash inside a Sigstore attestation
Model CardsFree-form model documentation (intended use, limits, evaluation results)Complementary · cite a PRML hash in the evaluation section
HELMCurated benchmark suite with standard metricsComplementary · HELM run-metadata can include a PRML commitment
OSF / ClinicalTrials.govCentralised study pre-registration (with state-anchored timestamps for trials)Conceptual analogue · PRML is the decentralised ML version, the hash itself is the anchor
NeurIPS / MLRC checklistsAuthor-completed protocol declarations (what was done, what was reported)Complementary · checklists describe, PRML proves

vs in-toto

in-toto is a framework for cryptographically capturing software supply-chain steps — which person or builder ran which step in which order, and what artifacts flowed between them. It originated at NYU SSL and has matured into widely used industrial tooling.

The overlap is conceptual: both produce signed records of "this happened before that." The difference is target. in-toto records build steps; PRML records evaluation claims. A PRML manifest could plausibly appear as an input artifact or output product within an in-toto step record, but PRML doesn't try to capture the build pipeline that produced the model — only the claim about how the model was supposed to perform.

PRML target
"This evaluation claim — threshold, metric, dataset split — was fixed before the model ran."
in-toto target
"This binary was produced by this builder running these steps in this order, signed."
Together
PRML inside an in-toto layout = full chain from build to commitment.

vs SLSA

SLSA (Supply-chain Levels for Software Artifacts) is a Google-originated framework that defines provenance levels (L1-L4) for build trustworthiness. It standardises what a build system has to attest to in order to claim each level.

SLSA answers "is this binary trustworthy?". PRML answers "was this evaluation claim fixed in advance?". The two address different stages: SLSA validates the artifact's origin; PRML validates the protocol bound to the artifact's evaluation. A model produced by a SLSA L3 pipeline can have its evaluation claim PRML-committed; both attestations apply, neither replaces the other.

vs Sigstore

Sigstore signs and time-stamps software artifacts using OIDC-backed identities and a transparency log (Rekor). It removes the operational pain of long-lived signing keys.

Sigstore is identity-attested signing; PRML is content-addressed commitment. They operate at different layers and compose cleanly: a PRML manifest hash can be wrapped inside a Sigstore attestation so the receipt has both a content-anchor (the SHA-256) and an identity-anchor (the signer). For institutional clients that want both "this claim was committed in advance" and "this organisation took responsibility for committing it", Sigstore + PRML is a natural pairing.

vs Model Cards

Model Cards (Mitchell et al., 2019) are structured-but-flexible documents capturing a model's intended use, training data, ethical considerations, and evaluation results. The Hugging Face implementation makes them widely available; the Google AI Hub variant is more enterprise-oriented.

Model Cards describe; PRML proves. The "evaluation results" section of a Model Card is a natural surface to embed a PRML hash, turning a textual claim ("our accuracy is 0.76 on ImageNet-Val") into a re-derivable receipt ("here is the SHA-256 we committed to before running, and here is the permalink anyone can verify against"). The two are not in conflict — PRML is the cryptographic anchor that gives the Model Card's evaluation section verifiability.

vs HELM

HELM (Holistic Evaluation of Language Models, Stanford CRFM) is a curated benchmark suite that evaluates LLMs across many scenarios with standardised metrics. The HELM dashboard publishes per-model scores; the codebase manages the harness.

HELM is downstream of any commitment. PRML is upstream: it commits which scenarios, metrics, and thresholds a user is binding themselves to before running the harness. A HELM run-metadata payload could optionally include a PRML hash; the two solve different parts of the eval rigor surface (HELM = comprehensive eval coverage; PRML = pre-run commitment receipt).

vs OSF / ClinicalTrials.gov pre-registration

The closest conceptual analogue. ClinicalTrials.gov (US government) and OSF (Open Science Framework, non-profit) both implement centralised pre-registration — researchers submit study designs in advance, a registry timestamps them, and reviewers can later verify that the published study matches the registered design.

PRML is the decentralised ML version. Three differences:

vs Reproducibility checklists

NeurIPS and MLRC reproducibility checklists are author-completed protocol declarations: yes/no/N-A questions about what the authors did and reported.

Checklists are self-attestation in natural language. PRML is cryptographic record. Checklists capture intent and process; PRML captures a specific commitment that can be re-derived independently. Both are valuable — a paper that completes the NeurIPS checklist and includes a PRML hash in its evaluation section is strictly stronger than one that does either alone.

What PRML does not compete with

PRML deliberately stays narrow:

If you're building eval-rigor infrastructure

Most teams will combine three or four of these primitives:

  1. Build / artifact: SLSA + Sigstore for the model artifact itself.
  2. Documentation: Model Card describing intended use, limits, and evaluation strategy.
  3. Evaluation commitment: PRML hash committed before each named claim, embedded in the Model Card's evaluation section and referenced in any paper or dashboard.
  4. Coverage: HELM, BIG-bench, lm-evaluation-harness for breadth.
  5. Reporting hygiene: NeurIPS / MLRC checklist for the paper.

None of these replace each other. PRML is the smallest possible piece; if your stack already covers the others, dropping in a PRML hash is a one-line addition that closes a specific gap (post-hoc threshold tuning) without restructuring anything.

Read the spec →