Open standard · MIT code · CC BY 4.0 spec
Lock the threshold before the run.
Then prove it.
A pre-committed claim can’t be silently rewritten after the result.
PRML makes that difference auditable — 9 fields, one hash, CI exits 3 on tamper.
The discipline is optional. The hash is not.
For compliance leads under the EU AI Act high-risk deadline → Article 12 working pattern
Most agents help you build faster.
This one helps you disprove faster.
You write down the claim and the failure criteria before you look at the data. The engine locks the spec with a hash, runs the declared experiment, and refuses to let you rewrite the threshold after seeing the result.
Spec is required to include an executable test plan, a metric, and a pre-registered threshold. Vague specs are rejected at lock time. Once locked, the canonical YAML is hashed — any edit produces a new hash and must be justified as an amendment.
The engine does not invent experiments. It calls the script you declared, in the environment you declared, and verifies the spec hash has not drifted between lock and run.
A numeric comparison against the locked threshold. No LLM judgment, no natural-language hedging. Exit code 0 for PASS, 10 for FAIL, 3 for tampered — so CI can gate on it.
A git commit-msg hook scans all logged verdicts and blocks commits, README edits, or marketing copy that contradict the recorded result.
Init. Lock. Run.
Verdict. Guard.
The claim becomes the spec. The spec becomes the hash. The hash becomes the verdict. The order is the contract.
The number changed.
The locked spec caught it.
Hypothesis A came from a personal research project of mine. In March 2026 my own operating note had recorded it as FAIL at -0.0139. Later I re-ran the same hypothesis through falsify against the current dataset. The number came out materially different. The engine surfaced a disagreement my old notebook could not: the numbers I had written down no longer reconciled with what the data actually produced.
Positive magnitude, small sample. The verdict is PASS against a pre-registered threshold of zero, but the engine refuses to let any “result confirmed” claim ship until the minimum sample size is met.
127 resolved events across the locked parent cohort. 16 eligible events, 19 demoted. Two locked specs (v1 March-frozen, v2 April-live), one contradictory historical claim. Reconciliation before marketing, not after.
Frozen corpus: outcome_ts < 2026-03-28. PASS against > 0 threshold.
Live corpus: full resolved ledger as of run. PASS, unreconciled with op-note.
The original March verdict. Engine cannot reproduce — methodology drift exposed.
Four entry points.
Pick the one that matches the job.
Each row maps a role to a working pattern on the spec subdomain. Same primitive, four framings.
A specification, not a tool.
The format is the contract.
Falsify is the reference implementation of PRML v0.1 — Pre-Registered ML Manifest Specification. Nine YAML fields. One SHA-256 over canonical bytes. Computed before the experiment runs. Verifiable offline by anyone with the manifest, the dataset, and the model. CC BY 4.0.
Four independent reference implementations — Python, JavaScript, Go, Rust — reproduce the conformance vectors byte-for-byte. v0.2 frozen 2026-05-22 (RFC closed).
Your past self isthe one your futureself answers to.
It does not tell you if you’re right. It tells you if you changed the rules.
pip install falsify
single-file CLI · no daemon · no account
Long-form working notes for compliance leads, AI governance officers, and notified-body assessors preparing for the 2 December 2027 high-risk deadline under Regulation (EU) 2024/1689. CC BY 4.0.
Author of PRML v0.1 and the four reference implementations (Python, JS, Go, Rust). Independent researcher working on AI evaluation infrastructure and the PRML / falsify track.