NUTS v10: Progressive Sampling,
ESS/Leapfrog, and Reproducible Benchmarks
We report a NUTS implementation in NextStat and a benchmark protocol that makes Bayesian performance numbers reproducible. We introduce ESS/leapfrog — a diagnostic that isolates sampler efficiency from wall-time effects — and show how progressive sampling at the top-level tree join doubles algorithmic efficiency on GLM and hierarchical posteriors — yielding up to 3.21× ESS/sec on a hierarchical non-centered benchmark.
February 2026 · ~15 min read
TL;DR: A single algorithmic shift — progressive sampling at the top-level tree join — materially improved sampler efficiency (ESS/leapfrog) while leaving trajectory-level diagnostics unchanged. Result: parity on flat GLM geometry and up to 3.21× faster (ESS/sec) on a hierarchical non-centered benchmark, under a strict multi-seed protocol.
1.Introduction
Bayesian performance claims are only meaningful when the posterior and the inference protocol are pinned. In practice, cross-framework comparisons regularly fail because the model differs subtly, adaptation defaults differ, diagnostics are computed differently, or wall-time reflects process orchestration rather than sampler quality.
NextStat addresses these issues with:
- ›An explicit posterior contract (model + priors + transforms).
- ›A sampler implementation with documented settings.
- ›A benchmark harness that emits schema-valid artifacts for every run, including failures.
Contributions
- ›A benchmark protocol and artifact format for Bayesian ESS/sec comparisons with explicit health gates and multi-seed stability.
- ›An efficiency decomposition using ESS/leapfrog to isolate algorithmic sampling quality from wall-time implementation effects.
- ›A Stan-compatible NUTS v10 implementation using progressive sampling in the top-level tree-doubling join, materially improving ESS/leapfrog on GLM and hierarchical posteriors.
2.Background: NUTS With Multinomial Trajectory Sampling
We assume Euclidean-metric HMC with position q ∈ ℝᴰ and momentum p ∈ ℝᴰ, with Hamiltonian:
2.1 Multinomial weights
In the multinomial variant of NUTS, each leaf is assigned an unnormalized weight wi = exp(−ΔHi). The within-tree proposal is selected proportionally to these weights (a categorical draw over the leaves), and subtrees are merged with a numerically stable log-sum-exp accumulation.
2.2 Generalized U-turn criterion
We use the generalized no-U-turn criterion based on momentum sums (Betancourt 2017) rather than position differences. At each subtree merge, three checks (full trajectory and two cross-boundary) stop growth when the trajectory begins to reverse direction.
Three U-turn checks per subtree merge:
Check 1 (full tree): is_turning(ρ_total, p_left, p_right)
Check 2 (init→junction): is_turning(ρ_init + p_junction, p_start, p_junction)
Check 3 (junction→final): is_turning(ρ_final + p_junction, p_junction, p_end)
If ANY check detects a U-turn → stop growing.
Matches Stan's base_nuts.hpp ("demand satisfaction around/between merged subtrees").3.Progressive Sampling in the Tree-Doubling Loop
NUTS builds a trajectory by repeatedly doubling the tree in a random direction. At each doubling step we have an existing tree (weight Wexist, sample zsample) and a new subtree (weight Wsub, sample zpropose).
Within the recursive tree builder, merging two subtrees uses the standard multinomial update:
At the top level (the outer doubling loop), we use progressive sampling for the join step:
This update biases selection away from retaining the initial point as the proposal pool grows. The scheme is described by Betancourt (Appendix A.3.2) and used in practical NUTS implementations including CmdStan.
3.1 Why this matters (equal-weight regime)
On well-conditioned posteriors with a well-adapted step size, subtree weight magnitudes are often comparable. In the equal-weight limit Wsub ≈ Wexist:
- ›Multinomial joining yields replace probability ≈ 0.5 — the current sample remains in the proposal pool frequently.
- ›Progressive joining yields replace probability 1 — aggressively moves the proposal forward as new trajectory mass is explored.
Key insight: this difference can dominate ESS/leapfrog while leaving other trajectory-level diagnostics unchanged (step sizes, accept rates, tree depths, total leapfrog counts). The fix is invisible to standard HMC diagnostics — only ESS/leapfrog reveals the mismatch.
4.Diagnostics and the ESS/Leapfrog Decomposition
4.1 Health gates
We report performance only for runs that pass baseline health checks: divergence rate, treedepth saturation, rank-normalized R̂, minimum bulk/tail ESS, and minimum E-BFMI.
4.2 ESS/sec
4.3 ESS/leapfrog — the key diagnostic
Wall time conflates algorithmic sampling efficiency with per-leapfrog computational cost (autodiff overhead, process orchestration, language runtime, BLAS backend). To isolate algorithmic behavior:
NextStat uses ESS/leapfrog as a first-line diagnostic for "algorithmic parity" before interpreting ESS/sec.
5.Implementation Notes
NextStat NUTS implementation:
crates/ns-inference/src/nuts.rs — NUTS sampler, build_tree, nuts_transition
crates/ns-inference/src/adapt.rs — Windowed warmup, dual averaging, Welford estimators
crates/ns-inference/src/hmc.rs — Leapfrog integrator, Metric (Diag/DenseCholesky)
crates/ns-inference/src/chain.rs — Multi-chain orchestration (Rayon parallel)
crates/ns-inference/src/diagnostics.rs — R̂, ESS, E-BFMI computation5.1 Code sketch (top-level join)
The key change is localized: the within-tree merge uses the standard multinomial update, while the outer join uses progressive selection:
// crates/ns-inference/src/nuts.rs
// exp(logW_sub - logW_exist) clamped to [0,1] (progressive sampling).
let p = prob_select_outer_progressive(tree.log_sum_weight, subtree.log_sum_weight);
if rng.random::<f64>() < p {
tree.q = subtree.q;
}Stan-parity details:
| Feature | NextStat | CmdStan |
|---|---|---|
| Step-size search | ε₀ = 1.0, double/halve | Same |
| Dual averaging | γ=0.05, t₀=10, κ=0.75 | Same |
| Proposal selection | Within-tree multinomial; top-level progressive | Same |
| Init strategy | Uniform(−2, 2), 100 retries | Same |
| Divergence threshold | |ΔH| > 1000 | Same |
| Step-size re-search | After every metric update | Same |
Python: reproduce the 3.21× case in one page
The v10 progressive sampling behavior is built into the core NUTS transition (no extra flag). The snippet below shows the shape of the Python API used by the benchmark harness.
import json
import nextstat
import nextstat.bayes as bayes
# Load the benchmark dataset (x, y, group_idx, n_groups).
spec = json.loads(open("hier_logistic_ri.json").read())
model = nextstat.ComposedGlmModel.logistic_regression(
spec["x"],
spec["y"],
include_intercept=True,
group_idx=spec["group_idx"],
n_groups=spec["n_groups"],
random_intercept_non_centered=True,
)
idata = bayes.sample(
model,
n_chains=4,
n_warmup=1000,
n_samples=2000,
seed=42,
target_accept=0.8,
)
quality = idata.attrs["nextstat_diagnostics"]["quality"]
print("min ESS (bulk):", quality["min_ess_bulk"])
print("max R-hat:", quality["max_r_hat"])6.Benchmark Protocol
6.1 Cases
- ›GLM logistic regression (6 parameters) — well-conditioned baseline.
- ›Hierarchical logistic random intercept (22 parameters, non-centered) — funnel-like geometry.
- ›Eight Schools (10 parameters, non-centered) — classic hierarchical benchmark.
6.2 Settings
All cases:
4 chains, 1000 warmup, 2000 samples
Diagonal metric
target_accept = 0.8 (Eight Schools: 0.95)
Dataset seed: 12345 (fixed)
Chain seeds: {42, 0, 123}6.3 Artifacts
Each run emits per-case JSON artifacts under nextstat.bayesian_benchmark_result.v1, a suite index JSON, and an aggregated multi-seed summary. All v10 results are checked into benchmarks/nextstat-public-benchmarks/suites/bayesian/results_v10/.
7.Results (February 2026)
Hardware: AMD EPYC 7502P (32C/64T), 128 GB RAM. CmdStan baseline: 2.38.0 (released 2026-01-13). All runs pass health gates: 0 divergences, max R̂ < 1.01.
7.1 ESS/sec (averaged across 3 seeds)
Bars show min bulk ESS/sec averaged across 3 seeds (Table 7.1). Gold = NextStat, gray = CmdStan.
| Model | NextStat | CmdStan | Ratio |
|---|---|---|---|
| GLM logistic (6p) | 29,895 ± 668 | 29,600 ± 1,971 | 1.01× |
| Hier logistic RI (22p) | 3,255 ± 360 | 1,015 ± 189 | 3.21× |
| Eight Schools NCP (10p) | 54,083 ± 4,902 | 26,644 ± 2,221 | 2.03× |
7.2 Health gates (worst across seeds)
| Model | Backend | Div | Max R̂ | Min E-BFMI |
|---|---|---|---|---|
| GLM logistic | nextstat | 0 | 1.001 | 1.011 |
| GLM logistic | cmdstanpy | 0 | 1.002 | — |
| Hier logistic RI | nextstat | 0 | 1.004 | 0.691 |
| Hier logistic RI | cmdstanpy | 0 | 1.002 | — |
| Eight Schools | nextstat | 0 | 1.003 | 0.883 |
| Eight Schools | cmdstanpy | 0 | 1.002 | — |
7.3 ESS/leapfrog (algorithmic efficiency)
This diagnostic isolates sampler efficiency per unit of Hamiltonian integration — the metric that revealed the v9 proposal-selection mismatch.
| Case | NextStat | CmdStan | Ratio |
|---|---|---|---|
| GLM logistic (6p) | 0.187 ± 0.006 | 0.164 ± 0.002 | 1.14× |
| Hier logistic RI (22p) | 0.044 ± 0.009 | 0.024 ± 0.005 | 1.82× |
| Eight Schools NCP (10p) | 0.052 ± 0.003 | 0.046 ± 0.004 | 1.14× |
ESS is computed via arviz_ess_bulk_min (supplementary; not part of v1 public schemas). Leapfrog totals are summed over post-warmup draws.
7.4 ESS/sec decomposition (implied)
Using the identity: (ESS/sec ratio) ≈ (ESS/LF ratio) × (LF/sec ratio).
| Case | ESS/sec ratio | ESS/LF ratio | Implied LF/sec |
|---|---|---|---|
| GLM logistic (6p) | 1.01× | 1.14× | 0.88× |
| Hier logistic RI (22p) | 3.21× | 1.82× | 1.76× |
| Eight Schools NCP (10p) | 2.03× | 1.14× | 1.78× |
The fix: in v9, ESS/leapfrog on GLM was 0.112 (57% of CmdStan). Switching the top-level join from multinomial to progressive sampling doubled it to 0.228 (117% of CmdStan). Tree depths, step sizes, and accept rates were unchanged — only ESS/leapfrog detected the issue.
8.Discussion
8.1 What changed in v10
In v9, trajectory-level diagnostics (step size, acceptance, tree depth) matched closely, but ESS/leapfrog was substantially lower on a GLM posterior. This pinpointed a proposal-selection issue rather than an integration or adaptation issue. Switching the top-level join to progressive sampling substantially improved ESS/leapfrog and improved ESS/sec across all tested geometries.
8.2 Threats to validity
Even with a pinned protocol, cross-framework comparisons remain sensitive to:
- ›Gradient implementation differences (analytical vs AD).
- ›Process orchestration overhead (single-process vs multi-process chains).
- ›CPU vectorization and BLAS configuration.
- ›Parameterization (centered vs non-centered).
We treat these as part of the experimental context and publish full artifacts so others can rerun the exact protocol.
9.Conclusion
ESS/leapfrog is an effective diagnostic to separate algorithmic sampling quality from wall-time implementation costs. Under a fixed multi-seed benchmark protocol, a single proposal-selection change (top-level progressive sampling) substantially improved sampler efficiency and yields strong ESS/sec results on hierarchical posteriors.
For full reproducibility, the pinned protocol, JSON artifacts, and figures are checked into benchmarks/nextstat-public-benchmarks/suites/bayesian/results_v10/.
Try NextStat now: Run NUTS v10 instantly in your browser: WASM Playground.Use it in Python: pip install nextstat (see Installation).Support open-source: Star NextStat on GitHub.
References
- ›Hoffman, M. D., Gelman, A. The No-U-Turn Sampler. JMLR 15(1), 1593–1623 (2014).
- ›Betancourt, M. A Conceptual Introduction to Hamiltonian Monte Carlo. arXiv:1701.02434 (2017).
- ›Carpenter, B., et al. Stan: A Probabilistic Programming Language. J. Stat. Soft. 76(1) (2017).
- ›Stan Development Team. Stan Reference Manual. Version 2.38 (2026).
- ›Stan Development Team. Release of CmdStan 2.38. January 13, 2026.
- ›Vehtari, A., et al. Rank-normalization, folding, and localization: An improved R̂. Bayesian Analysis (2021).
- ›Langenbruch, C. MoreFit: A More Optimised, Rapid and Efficient Fit. arXiv:2505.12414 (2025).
