2026-05-23 CASE STUDY #1 ~6 min

What would it take to make this LLM benchmark claim falsifiable?

A representative 2025 LLM technical report claims 87.2% accuracy on a popular reasoning benchmark. Five of the nine PRML falsifiability criteria are met; four are not. This is what a €15,000 Audit Review engagement adds, and what it deliberately leaves open. Anonymized to keep the focus on format, not publisher.

The claim, as published

The published source includes, in roughly this shape:

"Our model achieves 87.2 on [reasoning benchmark], surpassing the prior leader by 2.4 points. Evaluated on the full public test split with default settings; we report exact-match accuracy."

A reader can recognise: a metric (exact-match accuracy), a value (87.2), a dataset name (the benchmark), and a directional claim (higher is better). What a reader cannot recognise from the same paragraph: when the threshold of "good enough to publish" was fixed, which build of the model was evaluated, which version of the benchmark, and which seeds drove the sampling.

This is the modal LLM technical report claim in late 2025. It is not malicious, it is not unusually weak; it is typical.

Scoring against the PRML falsifiability criteria

The Integrity Index uses nine criteria. The representative claim above scores roughly:

CriterionPresent?What's in the source
metric_namedyesexact-match accuracy
value_givenyes87.2
dataset_namedyesbenchmark name + "full public test split"
dataset_hashnono content hash of the evaluation file
model_version_pinnednomodel family named, no build SHA or checkpoint hash
threshold_directionyeshigher-is-better is implicit; explicit "surpassing" framing
sample_sizeno"full split" stated, count not given
seed_publishednono seed or evaluation harness configuration
pre_registeredpartialthreshold direction was implicit before the result; the specific value was reported after observation

Score: 5 of 9. This puts the claim in the middle of the Integrity Index distribution alongside most well-known 2024-2025 LLM technical reports. The publisher did most of what the field currently rewards. The remaining four criteria are not field-standard, which is exactly what PRML is for.

What a PRML manifest would anchor

The Audit Review tier (€15,000, 5 business days, 100% async) produces a PRML manifest that locks the missing four criteria into a cryptographic commitment. Reconstructed from the published source, the manifest looks like this:

version: prml/0.2
claim_id: 01900000-0000-7000-8000-000000000001
created_at: '2025-06-15T14:30:00Z'
metric: exact_match_accuracy
comparator: '>='
threshold: 87.0
dataset:
  id: reasoning-benchmark-test-v1.2
  hash: sha256:b4a7c891fefb...
  split: test
  uri: https://huggingface.co/datasets/.../v1.2
model:
  id: example-llm-v2.3.0
  hash: sha256:d2f9a01acc4e...
sample_size: 1319
seed: 42
producer:
  id: example-org.example.com
notes: |
  Threshold of 87.0 fixed pre-evaluation per internal review on 2025-06-15.
  Threshold direction confirmed in advance; specific reported value 87.2
  observed post-evaluation against frozen test split v1.2.

The manifest is then anchored at registry.falsify.dev: a content-addressed permalink that any third party can re-derive the SHA-256 of from the canonical bytes alone.

Once the manifest exists and the hash is public, four things become detectable that previously were not:

  1. Threshold drift. If the threshold is silently raised or lowered after publication, the hash changes. The original commitment remains on the registry, and the difference is mechanically visible.
  2. Dataset swap. If the dataset content changes (a re-release of v1.2 with different examples, for instance), the dataset hash mismatch surfaces immediately during re-derivation.
  3. Model substitution. If a subsequent build is presented as the same evaluated artifact, the model hash mismatch flags it. PRML does not prevent substitution at runtime, but it makes silent substitution detectable in audit.
  4. Pre-commitment evidence. The created_at timestamp plus an external anchor (git commit, registry receipt, RFC 3161, or Sigstore Rekor entry) establishes when the threshold was fixed, which §8.1 of the spec distinguishes from producer-declared time.

What the audit memo says (6-8 pages)

The deliverable for the Audit Review tier is a written memo, not a call. The memo follows a fixed structure so a reviewer can scan it:

1. Claim summary (1 paragraph)

Recap of what was claimed, what was published, who the producer is, and which audience the claim is being defended in front of (notified body, internal MRM, customer security review, paper reviewer).

2. PRML manifest, annotated (1 page)

The full canonical YAML, with one-line annotations explaining each field's source: which field came from the public paper, which from an internal artifact the producer provided, which from a registry receipt. Every field has provenance.

3. Threat model under §8.1 (2-3 pages)

What the manifest defends against, claim by claim:

4. Regulator-mappable evidence (1-2 pages)

If the engagement names a target regulatory framework (EU AI Act, NIST AI RMF, ISO/IEC 42001), the memo includes a control-by-control mapping showing which clauses the manifest hash supplies evidence for. Each mapping carries the interpretive-mapping disclaimer published on the public crosswalk pages.

5. Re-derivation instructions (½ page)

A 30-50 line Python script the customer's auditor or regulator can run independently. Inputs: the manifest text. Outputs: the SHA-256 hash. If the hash matches the registry permalink, the commitment is verified; if not, the manifest has been tampered with.

6. Residual risk & recommended next steps (1 page)

What the manifest does not close, named without euphemism, and what the next-quarter or next-engagement options are. For most representative claims this includes: pair with Sigstore (cookbook Pattern 11) for execution attestation, or pair with commit-reveal validation (Pattern 13) for independence attestation, depending on the threat model.

What this engagement does not do

The Audit Review tier is one claim, written deliverables, 5 business days. It does not include:

If the producer of the claim disagrees with the audit memo's framing of a residual-risk point, the Audit Review's 5-business-day post-delivery Q&A window includes one written round of objection and response. The memo is then either revised or the customer's disagreement is noted as a published appendix. No defensive negotiation; the memo states what the artifact supports.

Cost, timeline, output

To recover the public-version-of-this engagement on a real claim:

What you are buying. An auditable artifact with cryptographic re-derivation, not a marketing badge or a compliance stamp. The artifact survives Studio 11; the spec is CC BY 4.0, the registry is content-addressed and outlives any single maintainer. If we close shop tomorrow, your manifest hash still verifies.

How to start

Email [email protected] with subject [Audit Review] and one paragraph naming the claim. Reply within one business day with a 1-page scope confirmation and the invoice. From there it's wire transfer and start.

Cüneyt Öztürk — falsify track lead, Studio 11. [email protected]. This case study is CC BY 4.0. Anonymization preserves the structure of a real, well-known 2025 LLM technical report claim without naming the publisher; the format choices described are not unique to one organization and the framing is meant to apply broadly. The Integrity Index scores 25+ named public claims if you want the unanonymized version.