Trust Offensive: Public Benchmarks

2026-01-25/8 min

BenchmarksReproducibilityTrustHistFactorypyhfNUTS

Performance claims in scientific software are scientific claims: they should be reproducible from artifacts alone, not from narrative descriptions. The core failure mode is not malice; it is ambiguity: input drift, warmup and cache drift, environment drift, and “correctness drift” (timing a different computation under the same name).

We're doing a trust offensive: publishing benchmark snapshots designed like experiments — with protocols, pinned environments, correctness gates, and auditable artifacts.

Canonical specification (protocol + artifacts): Public Benchmarks.

Series (recommended reading order)

›Trust Offensive: Public Benchmarks (this post) — the why + the trust model.
›The End of the Scripting Era — why "rerunnable evidence" changes how scientific software is built.
›Benchmark Snapshots as Products — CI artifacts, manifests, and baselines.
›Third-Party Replication: Signed Reports — external reruns as the strongest trust signal.
›Building a Trustworthy HEP Benchmark Harness — methodology for HistFactory benchmarking.
›Numerical Accuracy — ROOT vs pyhf vs NextStat, with reproducible evidence.
›Differentiable HistFactory in PyTorch — training NNs directly on Z₀.
›Bayesian Benchmarks: ESS/sec — how we make sampler comparisons meaningful.
›Pharma Benchmarks: PK/NLME — protocols for regulated-grade benchmarks.
›JAX Compile vs Execution — the latency benchmark that matters in short loops.
›Unbinned Event-Level Analysis — event-level likelihood with explicit contracts and correctness gates.
›Compiler-Symbolic vs Hybrid-Neural GPU Fits — symbolic JIT vs hybrid analytical+ONNX pipeline.

Companion docs (canonical runbooks)

Blog posts explain why and what it means. Docs explain how to rerun it.

Start here:

›Public Benchmarks Specification — what we measure, what we publish, what gets pinned.
›Benchmark Results — reproducible performance snapshots across 6 verticals.

1.What we're benchmarking (and what we're not)

We benchmark end-to-end workflows that real users run, not only micro-kernels:

›HEP / HistFactory — NLL evaluation, gradients, MLE fits, profile scans, toy ensembles
›Pharma — PK/NLME likelihood + fitting loops
›Bayesian — ESS/sec under well-defined inference settings
›ML infra — compile latency vs execution throughput (e.g., JAX, differentiable pipelines)
›Time Series — Kalman filter/smoother throughput, EM convergence cost
›Econometrics — Panel FE, DiD, IV/2SLS scaling with cluster count

We keep Criterion microbenchmarks as regression detectors, not as the headline.

Non-goals: one-off "hero numbers", unpublished harness scripts, performance without correctness gates.

2.The hard part: making benchmarks trustworthy

A. "Fast" because it's not doing the same thing

For binned likelihood pipelines, a benchmark is meaningless if the implementation is not numerically consistent with a reference. Our rule: before timing, the harness validates correctness within an explicit tolerance.

The validation report system formalizes this: every published snapshot includes a validation_report.json with dataset SHA-256 hashes, model specs, and per-suite pass/fail gates.

For snapshot indexing and replication, the contract is explicit:validation_report.json follows validation_report_v1, snapshots are indexed with nextstat.snapshot_index.v1, and external reruns can be compared via nextstat.replication_report.v1.

B. "Fast" because it's warmed up differently

JIT compilation, caching, GPU kernel loading, memory allocators, and Python import cost can dominate naive measurements. Every benchmark must specify warmup policy, steady-state measurement window, and what's included/excluded.

C. Environment isn't pinned

Scientific compute is sensitive to compiler versions, BLAS backends, GPU drivers, and Python dependency constraints. Every published snapshot includes enough environment metadata to reconstruct the build and runtime assumptions.

D. Reports only one convenient statistic

Single numbers hide variance. Our rule: publish raw per-test measurements and the aggregation policy (median, min-of-N) explicitly.

3.What we publish

›Raw results (per test, per repeat)
›Summary tables
›Baseline manifest (code SHA, env versions, dataset hashes, run config)
›Correctness/parity report used as gating
›validation_report.json + optional PDF via nextstat validation-report
›Snapshot index with SHA-256 artifact digests (nextstat.snapshot_index.v1)
›Optional replication report for third-party reruns (nextstat.replication_report.v1)

This is the difference between "trust me" and "rerun me".

4.Performance as a scientific claim

In research, we don't accept "it worked on my machine" for results. Performance should be treated the same way — especially when performance changes what analyses are feasible:

›Toy ensembles become practical at 10³–10⁵ scale
›Profile scans become interactive
›ML training can optimize inference metrics directly rather than surrogates

If a benchmark can't be reproduced, it's not evidence. It's an anecdote.

5.The suites

Suite	Focus
HEP	pyhf + ROOT/RooFit with NLL parity gates, GPU batch toys (CUDA + Metal)
Pharma	PK/NLME likelihood + fitting with analytic reference baselines
Bayesian	ESS/sec vs Stan + PyMC, SBC calibration
ML	Compile vs execution latency, differentiable pipeline throughput
Time Series	Kalman filter/smoother throughput, EM convergence cost
Econometrics	Panel FE, DiD, IV/2SLS scaling with cluster count

6.The ask: rerun it

Public benchmarks only work if other people rerun them. The most valuable contribution you can make is:

›Rerun the harness on your hardware
›Publish your manifest + results
›Tell us what diverges (numbers, settings, correctness gates)

That is how "fast" becomes "trusted".

Public Benchmarks Specification The End of the Scripting Era (Blog)Third-Party Replication: Signed Reports (Blog)Validation Report Artifacts