NextStatNextStat

Bayesian Benchmarks That Mean Something

ESS/sec vs Wall-Time — rigorous protocols for comparisons vs Stan and PyMC.

BayesianNUTSESS/secStanPyMC

2026-02-08 · 7 min read


Trust Offensive series:Index·Prev: Differentiable HistFactory·Next: Pharma Benchmarks

Bayesian benchmarks are notorious because they collapse a multi-dimensional object (a posterior + a sampling algorithm) into a single number. That can be useful — if you control the experiment.

This post explains how we plan to benchmark Bayesian inference using ESS/sec in a way that is reproducible and comparable across frameworks.

Canonical protocol + artifact contract: Public Benchmarks. Validation pack contract: Validation Report.

Abstract. We treat Bayesian performance numbers as scientific claims: we define the posterior, the inference protocol, report sampling efficiency (ESS/sec, bulk + tail) and health (divergences, treedepth saturation, R̂, E-BFMI), and publish artifacts so an outsider can rerun.

1.Why ESS/sec is the right first metric (and when it isn't)

For gradient-based samplers, wall-time per iteration is not the whole story. Effective Sample Size (ESS) captures both computational cost and sampling efficiency.

But ESS/sec only makes sense when:

  • The model is the same
  • The parameterization is the same (centered vs non-centered can dominate)
  • Diagnostics are computed consistently

2.What we publish: ESS/sec + "health" metrics

ESS/sec alone can be gamed by pathological runs. So we publish, at minimum:

  • Bulk ESS/sec and tail ESS/sec (per parameter group)
  • Divergence rate
  • Max treedepth saturation rate
  • Max R̂ across parameters
  • Minimum ESS (bulk + tail) across parameters
  • Minimum E-BFMI across chains

Those are not "extra diagnostics" — they are part of what it means for a run to count.


3.The benchmarking protocol (what must be pinned)

To avoid benchmark theater, we pin:

  • Model definition + priors
  • Parameterization (explicitly)
  • Warmup length and sampling length
  • Adaptation settings: target acceptance, step-size policy, mass matrix policy + update schedule
  • RNG seeding policy (and determinism policy)

If any of these differ, we don't call it a comparison — we call it a different experiment.


4.Correctness gates: performance numbers must be "allowed to exist"

We separate posterior correctness from performance and publish both.

4.1 NUTS quality smoke suite (fast)

The Apex2 "NUTS quality" runner catches catastrophic regressions in posterior transform plumbing, HMC/NUTS stability, and diagnostics plumbing:

PYTHONPATH=bindings/ns-py/python ./.venv/bin/python \
  tests/apex2_nuts_quality_report.py \
  --deterministic \
  --out tmp/apex2_nuts_quality_report.json

Includes small, interpretable cases: Gaussian mean, strong-prior posterior, Neal's funnel stress case.

4.2 Simulation-Based Calibration (SBC) (slow, stronger evidence)

SBC validates the posterior more directly: for synthetic datasets generated from the prior, posterior ranks should be uniform.

NS_RUN_SLOW=1 NS_SBC_RUNS=20 NS_SBC_WARMUP=200 NS_SBC_SAMPLES=200 \
  PYTHONPATH=bindings/ns-py/python ./.venv/bin/python \
  tests/apex2_sbc_report.py \
  --deterministic \
  --out tmp/apex2_sbc_report.json

This is not a performance benchmark — it's a correctness gate that makes performance comparisons meaningful.


5.What we will publish (artifacts)

For each benchmark snapshot:

  • Baseline manifest (versions, hardware, settings)
  • Raw timing measurements
  • ESS metrics (bulk + tail) per parameter group
  • Key diagnostics (divergences, tree depth, step size)

If a run has pathologies (divergences, failure to adapt), we publish that as a result, not as a footnote.


6.Known pitfalls (we will document them explicitly)

A) Parameterization dominates

The same model in centered vs non-centered form can differ by orders of magnitude in ESS/sec. We treat parameterization as part of the benchmark input.

B) Different defaults are different experiments

Stan, PyMC, and custom implementations have different adaptation defaults. Benchmarks must either align these settings or publish them and interpret differences.

C) ESS computation must be consistent

ESS depends on the method and version of the diagnostics tool. We publish the exact computation policy and version.


7.How to run a local microbenchmark (today)

For a Rust-only microbenchmark of NUTS wall-time under a pinned NutsConfig:

cargo bench -p ns-inference --bench nuts_benchmark

This covers a small Normal mean model and a tiny HistFactory fixture; useful for regression detection, not for cross-framework "wins".


8.What you should take away

When we publish Bayesian benchmark numbers, you can answer: "what exact posterior was sampled?", "under what exact settings?", "how healthy were the diagnostics?" — and rerun the same harness on your hardware.


A.Harness status (today)

The public benchmarks repo includes a runnable Bayesian seed suite. It supports a comma-separated backend list (e.g. --backends nextstat,cmdstanpy,pymc). NextStat is always available; CmdStanPy and PyMC are optional. If optional dependencies are missing, the runner emits a schema-valid JSON artifact with status="warn" and an actionable reason, instead of failing the whole suite.

Current coverage nuance: the Simple HistFactory case is NextStat-only today. Cross-framework comparisons target the synthetic GLM and hierarchical baselines.

Artifact contracts written by the suite runner:

  • Per-case result: nextstat.bayesian_benchmark_result.v1
  • Suite index: nextstat.bayesian_benchmark_suite_result.v1
# NextStat-only (dependency-light)
python3 suites/bayesian/suite.py --deterministic --out-dir out/bayesian

# With optional backends (best-effort)
python3 suites/bayesian/suite.py --deterministic --out-dir out/bayesian --backends nextstat,cmdstanpy,pymc

Backends

  • NextStat — native NUTS sampler, always available
  • CmdStanPy — calls CmdStan binary (refresh=1 to avoid silent-hang on short runs)
  • PyMC — PyMC NUTS via PyTensor/JAX backend

Initial public baseline set (recommended)

  • Neal's funnel — classic pathological geometry (10–100 dims)
  • Eight schools — standard hierarchical model
  • Logistic regression — German credit or similar (moderate dims)

The baseline manifest includes GPU metadata best-effort via nvidia-smi (device name, driver version, CUDA version, memory) when running on GPU-equipped runners.


Related reading