PRML · the open standard for pre-registered ML eval claims · CC BY 4.0

falsify

Lock the threshold before the run.
Then prove it.

Your model passed. Can you prove the success criteria were not changed after the test?
PRML — the open standard falsify implements — makes that provable: 9 fields, one hash, CI exits 3 on tamper.
It proves the bar was locked. Never the result.

The discipline is optional. The hash is not.

For compliance leads under the EU AI Act high-risk deadline → Article 12 working pattern

receipts: Ed25519 signed · RFC 3161 timestamped · mirrored to the Rekor transparency log · verify at registry.falsify.dev · listed in UK AISI Inspect extensions

◆ For Your Role ◆

Routing

Four entry points.
Pick the one that matches the job.

Each row maps a role to a working pattern on the spec subdomain. Same primitive, four framings.

Compliance lead

Mapping ML to EU AI Act Article 12

The logging article (Art. 12) applies to high-risk systems from 2 December 2027 — deferred from 2 August 2026 by the EU Digital Omnibus. A PRML manifest is one concrete record that can sit inside an Art. 12 log; it is not the whole logging system.

eu-ai-act/article-12 →

ML governance / audit

Running ML evaluation reproducibility audits

Dataset hashes, seed handling, threshold drift, model-card verification. Audit checklist with PRML manifest examples.

reproducibility/eval-audit →

Researcher

Bringing pre-registration discipline to ML

From AsPredicted and OSF norms to a content-addressed manifest. For people crossing over from clinical and psychology.

pre-registration →

Self-assessment

Self-assess your EU AI Act evidence readiness in 10 questions

Free. No signup. Maps your current evidence posture against the 2 December 2027 high-risk enforcement bar and points at the gaps.

Run the 10-question assessment →

◆ Loading Thesis ◆

Thesis

Most agents help you build faster.
This one helps you disprove faster.

You write down the claim and the failure criteria before you look at the data. The engine locks the spec with a hash, runs the declared experiment, and refuses to let you rewrite the threshold after seeing the result.

Theorem · Q.E.D.

Theorem

Givenan AI agent producing a numeric claim about model performance.

Claimthe result is falsifiable iff the threshold was registered before the evaluation.

Proof

[I]

Declare. Lock.

Spec is required to include an executable test plan, a metric, and a pre-registered threshold. Vague specs are rejected at lock time. Once locked, the canonical YAML is hashed — any edit produces a new hash and must be justified as an amendment.

[II]

Run. Frozen.

The engine does not invent experiments. It calls the script you declared, in the environment you declared, and verifies the spec hash has not drifted between lock and run.

[III]

Verdict. Numeric.

A numeric comparison against the locked threshold. No LLM judgment, no natural-language hedging. Exit code 0 for PASS, 10 for FAIL, 3 for tampered — so CI can gate on it.

[IV]

Guard. Blocks.

A git commit-msg hook scans all logged verdicts and blocks commits, README edits, or marketing copy that contradict the recorded result.

∎

The threshold you set is the one you answer to.Q.E.D. · 04fa1689ac55

◆ Initializing System ◆

The System

Init. Lock. Run.
Verdict. Guard.

The claim becomes the spec. The spec becomes the hash. The hash becomes the verdict. The order is the contract.
Both CLIs ship with pip install falsify: falsify is the PRML reference, falsify-engine the workflow engine below.

falsify-engine · Hypothesis A · v2

Live

Spec Ledger · falsify-engine list4 Locked

claude_surface · dogfoodPASS · 10

cli_startup · dogfoodPASS · 58ms

test_coverage_count · dogfoodPASS · 601

hypothesis_a · case_studyFAIL · 0.214

honesty −−scope dogfood1.00

Tamper detection · change the valueINTACT

threshold: above

spec_hash: 04fa1689ac55

Speccanonical YAML · SHA-256 hashed before run

Verdictnumeric compare · no LLM judgment

Guardgit commit-msg hook · blocks contradictions

Surface5 skills · 2 subagents · 3 commands · 1 MCP

exit 0PASSverdict satisfies the locked threshold

exit 10FAILverdict against the locked threshold; ship as recorded

exit 3TAMPEREDspec edited after lock · CI refuses to ship

◆ Scanning Evidence ◆

Evidence

The number changed.
The locked spec caught it.

Dogfood case: a March operating note recorded −0.0139; the locked spec re-run produced a materially different number. The engine surfaced what the notebook could not — the recorded verdict no longer reconciled with the data. Full write-up →

v2 event-level delta — eligible vs demoted cohort

+0.0000

Positive magnitude, small sample. The verdict is PASS against a pre-registered threshold of zero, but the engine refuses to let any “result confirmed” claim ship until the minimum sample size is met.

127 resolved events across the locked parent cohort. 16 eligible events, 19 demoted. Two locked specs (v1 March-frozen, v2 April-live), one contradictory historical claim. Reconciliation before marketing, not after.

Locked spec · hash 04fa1689ac55

◆ Specification ◆

PRML v0.1

A specification, not a tool.
The format is the contract.

Falsify is the reference implementation of PRML v0.1 — Pre-Registered ML Manifest Specification. Nine YAML fields. One SHA-256 over canonical bytes. Computed before the experiment runs. Verifiable offline by anyone with the manifest, the dataset, and the model. CC BY 4.0.

Four independent reference implementations — Python, JavaScript, Go, Rust — reproduce the conformance vectors byte-for-byte. v0.2 frozen 2026-05-22 (RFC closed).

Specification

spec.falsify.dev/v0.1

RFC-style spec. CC BY 4.0. Nine required fields. 21 conformance vectors (13 v0.1 + 8 v0.2 candidate) with locked SHA-256.

Preprint — Zenodo DOI

14 pages, LaTeX

Threat model, EU AI Act mapping (Articles 12 · 17 · 18 · 50 · 72 · 73), NIST AI RMF, ISO 42001. Archived with a citable DOI.

v0.2 RFC

frozen 2026-05-22

Ten changes. Five open RFC questions. Public review window closed 2026-05-22; feedback rolls into the v0.3 cycle.

13/13 + 8/8 · ~400 LOC

Reference 3

13/13 + 8/8 · ~450 LOC

Reference 4

Rust

13/13 + 8/8 · ~600 LOC

◆ Access Protocol ◆

Source · MIT · Public Repo

v0.3.11

Your past self isthe one your futureself answers to.

It does not tell you if you’re right. It tells you if you changed the rules.

$ pip install falsify

single-file CLI · no daemon · no account

View on GitHub → 90s demo — lock, run, watch the guard block a contradiction →

Surfaces

Spec · v0.1 PRML working draft Full specification with grammar, canonicalization, §8.1 limitations. CC BY 4.0. Registry · live Commit a manifest Paste YAML, get a SHA-256 permalink and a README badge. No account. CI integration · v2 prml-verify-action Five-line GitHub Action. Block merges on tampered or regressed eval claims. Composite, MIT. Engagement · scoped per client Diagnostic Sprint Fixed-scope written engagements, from a €3,000 design-partner pilot to 8–10 week enterprise scopes. Manifest authored, verifier deployed, evidence pack shipped. Reference What is PRML? Plain-English answer page. Citable for AI search and academic context. Comparison PRML vs … Side-by-side with in-toto, SLSA, Sigstore, Model Cards, HELM, OSF, MLRC checklists. Tool PRML Claim Check Paste an ML benchmark claim. See which falsifiability criteria it skipped. Index PRML Disclosure Index 27 well-known ML eval claims, scored against the 9 disclosure criteria. For gatekeepers Require a receipt Run a leaderboard, benchmark, or review process? A copy-paste CI gate: no locked claim, no entry.

EU AI Act · compliance resources

Long-form working notes for compliance leads, AI governance officers, and notified-body assessors preparing for the 2 December 2027 high-risk deadline under Regulation (EU) 2024/1689. CC BY 4.0.

Article 9, 12, 13, 14, 17, 18 EU AI Act readiness assessment Six articles, ten questions, defensible evidence shape for high-risk AI providers. 2 December 2027 Deadline & ten-week plan Three application dates, risk classification, Article 99 penalty structure, pragmatic execution path. Article 12 logging Ten-item logging checklist Six event categories, ten-year retention, printable single-page summary. New · Reg 2026/1744 Article 25(2) hand-over pack Provider hand-overs are now a fineable evidence event. What must be disclosed, and how to timestamp it. Annex VI vs VII Notified body evidence When you need a notified body, what an Article 31 assessor actually asks for, six artefact families. ISO/IEC 42001:2023 42001 AIMS readiness Seven clauses, EU AI Act Article 17 overlap, twelve-month certification path. Case study #1 Anonymized LLM tech report audit Score 5/9 on a typical published claim. PRML manifest, audit memo, what we caught. Field report · 2026-05-22 Lock #2 post-mortem The first thing PRML falsified was its own distribution hypothesis. The second was its own counting bug.

Cüneyt Öztürk · PRML / falsify

Cüneyt Öztürk

Author of PRML v0.1 and the four reference implementations (Python, JS, Go, Rust). Independent researcher working on AI evaluation infrastructure and the PRML / falsify track.

LinkedIn → [email protected] GitHub → falsify.dev

falsify

Four entry points.Pick the one that matches the job.

Most agents help you build faster.This one helps you disprove faster.

Init. Lock. Run.Verdict. Guard.

The number changed.The locked spec caught it.