Paste a benchmark claim. See what it forgot to say.
A 30-second forensic breakdown of any ML accuracy, refusal-rate, or pass-rate claim. We check it against 9 falsifiability criteria the spec considers minimum hygiene. Heuristic — not authoritative — but it surfaces what most published claims quietly omit.
This is a heuristic regex-based check, not a formal PRML verifier. Some honest claims will score low because they describe their methodology elsewhere; some dishonest ones might gain points by mentioning the right keywords without backing them up. The score surfaces structural omissions — interpret accordingly. The real spec is at spec.falsify.dev/v0.1.
If this surfaced something you didn't expect
The 9 criteria above aren't arbitrary — they're the structural fields PRML asks you to commit to a SHA-256 hash before you run an evaluation. Once committed, the hash is your tamper-evident receipt that the threshold and metric were fixed in advance.