What would it take to make this LLM benchmark claim falsifiable?
A representative 2025 LLM technical report claims 87.2% accuracy on a popular reasoning benchmark. Five of the nine PRML falsifiability criteria are met; four are not. This is what a €15,000 Audit Review engagement adds, and what it deliberately leaves open. Anonymized to keep the focus on format, not publisher.
The claim, as published
The published source includes, in roughly this shape:
"Our model achieves 87.2 on [reasoning benchmark], surpassing the prior leader by 2.4 points. Evaluated on the full public test split with default settings; we report exact-match accuracy."
A reader can recognise: a metric (exact-match accuracy), a value (87.2), a dataset name (the benchmark), and a directional claim (higher is better). What a reader cannot recognise from the same paragraph: when the threshold of "good enough to publish" was fixed, which build of the model was evaluated, which version of the benchmark, and which seeds drove the sampling.
This is the modal LLM technical report claim in late 2025. It is not malicious, it is not unusually weak; it is typical.
Scoring against the PRML falsifiability criteria
The Integrity Index uses nine criteria. The representative claim above scores roughly:
| Criterion | Present? | What's in the source |
|---|---|---|
| metric_named | yes | exact-match accuracy |
| value_given | yes | 87.2 |
| dataset_named | yes | benchmark name + "full public test split" |
| dataset_hash | no | no content hash of the evaluation file |
| model_version_pinned | no | model family named, no build SHA or checkpoint hash |
| threshold_direction | yes | higher-is-better is implicit; explicit "surpassing" framing |
| sample_size | no | "full split" stated, count not given |
| seed_published | no | no seed or evaluation harness configuration |
| pre_registered | partial | threshold direction was implicit before the result; the specific value was reported after observation |
Score: 5 of 9. This puts the claim in the middle of the Integrity Index distribution alongside most well-known 2024-2025 LLM technical reports. The publisher did most of what the field currently rewards. The remaining four criteria are not field-standard, which is exactly what PRML is for.
What a PRML manifest would anchor
The Audit Review tier (€15,000, 5 business days, 100% async) produces a PRML manifest that locks the missing four criteria into a cryptographic commitment. Reconstructed from the published source, the manifest looks like this:
version: prml/0.2 claim_id: 01900000-0000-7000-8000-000000000001 created_at: '2025-06-15T14:30:00Z' metric: exact_match_accuracy comparator: '>=' threshold: 87.0 dataset: id: reasoning-benchmark-test-v1.2 hash: sha256:b4a7c891fefb... split: test uri: https://huggingface.co/datasets/.../v1.2 model: id: example-llm-v2.3.0 hash: sha256:d2f9a01acc4e... sample_size: 1319 seed: 42 producer: id: example-org.example.com notes: | Threshold of 87.0 fixed pre-evaluation per internal review on 2025-06-15. Threshold direction confirmed in advance; specific reported value 87.2 observed post-evaluation against frozen test split v1.2.
The manifest is then anchored at registry.falsify.dev: a content-addressed permalink that any third party can re-derive the SHA-256 of from the canonical bytes alone.
Once the manifest exists and the hash is public, four things become detectable that previously were not:
- Threshold drift. If the threshold is silently raised or lowered after publication, the hash changes. The original commitment remains on the registry, and the difference is mechanically visible.
- Dataset swap. If the dataset content changes (a re-release of v1.2 with different examples, for instance), the dataset hash mismatch surfaces immediately during re-derivation.
- Model substitution. If a subsequent build is presented as the same evaluated artifact, the model hash mismatch flags it. PRML does not prevent substitution at runtime, but it makes silent substitution detectable in audit.
- Pre-commitment evidence. The
created_attimestamp plus an external anchor (git commit, registry receipt, RFC 3161, or Sigstore Rekor entry) establishes when the threshold was fixed, which §8.1 of the spec distinguishes from producer-declared time.
What the audit memo says (6-8 pages)
The deliverable for the Audit Review tier is a written memo, not a call. The memo follows a fixed structure so a reviewer can scan it:
1. Claim summary (1 paragraph)
Recap of what was claimed, what was published, who the producer is, and which audience the claim is being defended in front of (notified body, internal MRM, customer security review, paper reviewer).
2. PRML manifest, annotated (1 page)
The full canonical YAML, with one-line annotations explaining each field's source: which field came from the public paper, which from an internal artifact the producer provided, which from a registry receipt. Every field has provenance.
3. Threat model under §8.1 (2-3 pages)
What the manifest defends against, claim by claim:
- Retroactive threshold tuning: detectable, hash mismatch.
- Dataset content swap after publication: detectable, dataset hash mismatch.
- Selective sample omission within the same dataset: not detectable without companion attestation (see Pattern 11 / 13 in the cookbook); residual risk noted.
- Model build substitution: detectable in audit if the checkpoint hash is captured; not preventable at evaluation runtime.
- Selective publication of one claim out of many: explicitly out of scope per §8.1; flagged for reader awareness.
4. Regulator-mappable evidence (1-2 pages)
If the engagement names a target regulatory framework (EU AI Act, NIST AI RMF, ISO/IEC 42001), the memo includes a control-by-control mapping showing which clauses the manifest hash supplies evidence for. Each mapping carries the interpretive-mapping disclaimer published on the public crosswalk pages.
5. Re-derivation instructions (½ page)
A 30-50 line Python script the customer's auditor or regulator can run independently. Inputs: the manifest text. Outputs: the SHA-256 hash. If the hash matches the registry permalink, the commitment is verified; if not, the manifest has been tampered with.
6. Residual risk & recommended next steps (1 page)
What the manifest does not close, named without euphemism, and what the next-quarter or next-engagement options are. For most representative claims this includes: pair with Sigstore (cookbook Pattern 11) for execution attestation, or pair with commit-reveal validation (Pattern 13) for independence attestation, depending on the threat model.
What this engagement does not do
The Audit Review tier is one claim, written deliverables, 5 business days. It does not include:
- CI pipeline deployment (that's the Full Sprint tier)
- Isolated private registry instance (Enterprise tier)
- Multiple claims, multi-repo coverage, or framework-personalized crosswalks (Full Sprint or Enterprise)
- Phone calls, video meetings, or scheduled briefings (all engagements are 100% async by default)
- An assertion that the underlying claim is true. PRML proves the claim was committed; it does not prove the result. The audit memo is explicit on this.
If the producer of the claim disagrees with the audit memo's framing of a residual-risk point, the Audit Review's 5-business-day post-delivery Q&A window includes one written round of objection and response. The memo is then either revised or the customer's disagreement is noted as a published appendix. No defensive negotiation; the memo states what the artifact supports.
Cost, timeline, output
To recover the public-version-of-this engagement on a real claim:
- Price: €15,000, single invoice from Studio 11 Türkiye Ltd. Şti. (USD / GBP also accepted at invoice time)
- Timeline: 5 business days from payment receipt to memo delivery
- Output: PRML manifest (canonical YAML) + registry permalink + 6-8 page audit memo + re-derivation script + 5-day written Q&A window
- Method: wire transfer to a Türkiye-based corporate bank account; SWIFT details on the invoice
- Engagement scope: one published or about-to-publish claim of your choice; you choose the regulatory framing or skip it
How to start
Email [email protected] with subject [Audit Review] and one paragraph naming the claim. Reply within one business day with a 1-page scope confirmation and the invoice. From there it's wire transfer and start.