NUTS v10: Progressive Sampling,
ESS/Leapfrog, and Reproducible Benchmarks

We report a NUTS implementation in NextStat and a benchmark protocol that makes Bayesian performance numbers reproducible. We introduce ESS/leapfrog — a diagnostic that isolates sampler efficiency from wall-time effects — and show how progressive sampling at the top-level tree join doubles algorithmic efficiency on GLM and hierarchical posteriors — yielding up to 3.21× ESS/sec on a hierarchical non-centered benchmark.

NUTSBAYESIANESS/SECPROGRESSIVE SAMPLINGSTANREPRODUCIBILITY

February 2026 · ~15 min read

TL;DR: A single algorithmic shift — progressive sampling at the top-level tree join — materially improved sampler efficiency (ESS/leapfrog) while leaving trajectory-level diagnostics unchanged. Result: parity on flat GLM geometry and up to 3.21× faster (ESS/sec) on a hierarchical non-centered benchmark, under a strict multi-seed protocol.

1.Introduction

Bayesian performance claims are only meaningful when the posterior and the inference protocol are pinned. In practice, cross-framework comparisons regularly fail because the model differs subtly, adaptation defaults differ, diagnostics are computed differently, or wall-time reflects process orchestration rather than sampler quality.

NextStat addresses these issues with:

›An explicit posterior contract (model + priors + transforms).
›A sampler implementation with documented settings.
›A benchmark harness that emits schema-valid artifacts for every run, including failures.

Contributions

›A benchmark protocol and artifact format for Bayesian ESS/sec comparisons with explicit health gates and multi-seed stability.
›An efficiency decomposition using ESS/leapfrog to isolate algorithmic sampling quality from wall-time implementation effects.
›A Stan-compatible NUTS v10 implementation using progressive sampling in the top-level tree-doubling join, materially improving ESS/leapfrog on GLM and hierarchical posteriors.

2.Background: NUTS With Multinomial Trajectory Sampling

We assume Euclidean-metric HMC with position q ∈ ℝᴰ and momentum p ∈ ℝᴰ, with Hamiltonian:

\[ \begin{aligned} H(q, p) &= U(q) + K(p), \\ U(q) &= -\log \pi(q), \\ K(p) &= \tfrac{1}{2} p^\top M^{-1} p. \end{aligned} \] \[\Delta H_i = H(q_i, p_i) - H(q_0, p_0).\]

2.1 Multinomial weights

In the multinomial variant of NUTS, each leaf is assigned an unnormalized weight w_i = exp(−ΔH_i). The within-tree proposal is selected proportionally to these weights (a categorical draw over the leaves), and subtrees are merged with a numerically stable log-sum-exp accumulation.

2.2 Generalized U-turn criterion

We use the generalized no-U-turn criterion based on momentum sums (Betancourt 2017) rather than position differences. At each subtree merge, three checks (full trajectory and two cross-boundary) stop growth when the trajectory begins to reverse direction.

text

Three U-turn checks per subtree merge:

Check 1 (full tree):      is_turning(ρ_total, p_left, p_right)
Check 2 (init→junction):  is_turning(ρ_init + p_junction, p_start, p_junction)
Check 3 (junction→final): is_turning(ρ_final + p_junction, p_junction, p_end)

If ANY check detects a U-turn → stop growing.
Matches Stan's base_nuts.hpp ("demand satisfaction around/between merged subtrees").

3.Progressive Sampling in the Tree-Doubling Loop

NUTS builds a trajectory by repeatedly doubling the tree in a random direction. At each doubling step we have an existing tree (weight W_exist, sample z_sample) and a new subtree (weight W_sub, sample z_propose).

Within the recursive tree builder, merging two subtrees uses the standard multinomial update:

\[ \Pr(\text{select right subtree}) = \frac{W_{\mathrm{right}}}{W_{\mathrm{left}} + W_{\mathrm{right}}}. \]

At the top level (the outer doubling loop), we use progressive sampling for the join step:

\[ \Pr(z_{\mathrm{sample}} \leftarrow z_{\mathrm{propose}}) = \min\left(1, \frac{W_{\mathrm{sub}}}{W_{\mathrm{exist}}}\right). \]

This update biases selection away from retaining the initial point as the proposal pool grows. The scheme is described by Betancourt (Appendix A.3.2) and used in practical NUTS implementations including CmdStan.

3.1 Why this matters (equal-weight regime)

On well-conditioned posteriors with a well-adapted step size, subtree weight magnitudes are often comparable. In the equal-weight limit W_sub ≈ W_exist:

›Multinomial joining yields replace probability ≈ 0.5 — the current sample remains in the proposal pool frequently.
›Progressive joining yields replace probability 1 — aggressively moves the proposal forward as new trajectory mass is explored.

Key insight: this difference can dominate ESS/leapfrog while leaving other trajectory-level diagnostics unchanged (step sizes, accept rates, tree depths, total leapfrog counts). The fix is invisible to standard HMC diagnostics — only ESS/leapfrog reveals the mismatch.

Top-level join: multinomial vs progressive

4.Diagnostics and the ESS/Leapfrog Decomposition

4.1 Health gates

We report performance only for runs that pass baseline health checks: divergence rate, treedepth saturation, rank-normalized R̂, minimum bulk/tail ESS, and minimum E-BFMI.

4.2 ESS/sec

\[ \mathrm{ESS/sec} = \frac{\min_j \; \mathrm{ESS}_{\mathrm{bulk}}(j)}{T_{\mathrm{wall}}}. \] \[\text{Here } T_{\mathrm{wall}} = T_{\mathrm{warmup}} + T_{\mathrm{sample}}.\]

4.3 ESS/leapfrog — the key diagnostic

Wall time conflates algorithmic sampling efficiency with per-leapfrog computational cost (autodiff overhead, process orchestration, language runtime, BLAS backend). To isolate algorithmic behavior:

\[ \mathrm{ESS/LF} = \frac{\min_j \; \mathrm{ESS}_{\mathrm{bulk}}(j)}{N_{\mathrm{LF}}}. \] \[\text{Here } N_{\mathrm{LF}} \text{ is the total number of leapfrog steps across all chains and draws.}\] \[\text{Different ESS/LF at similar } N_{\mathrm{LF}} \Rightarrow \text{proposal selection / accept dynamics, not runtime overhead.}\]

NextStat uses ESS/leapfrog as a first-line diagnostic for "algorithmic parity" before interpreting ESS/sec.

5.Implementation Notes

text

NextStat NUTS implementation:

crates/ns-inference/src/nuts.rs     — NUTS sampler, build_tree, nuts_transition
crates/ns-inference/src/adapt.rs    — Windowed warmup, dual averaging, Welford estimators
crates/ns-inference/src/hmc.rs      — Leapfrog integrator, Metric (Diag/DenseCholesky)
crates/ns-inference/src/chain.rs    — Multi-chain orchestration (Rayon parallel)
crates/ns-inference/src/diagnostics.rs — R̂, ESS, E-BFMI computation

5.1 Code sketch (top-level join)

The key change is localized: the within-tree merge uses the standard multinomial update, while the outer join uses progressive selection:

rust

// crates/ns-inference/src/nuts.rs
// exp(logW_sub - logW_exist) clamped to [0,1] (progressive sampling).
let p = prob_select_outer_progressive(tree.log_sum_weight, subtree.log_sum_weight);
if rng.random::<f64>() < p {
    tree.q = subtree.q;
}

Stan-parity details:

Feature	NextStat	CmdStan
Step-size search	ε₀ = 1.0, double/halve	Same
Dual averaging	γ=0.05, t₀=10, κ=0.75	Same
Proposal selection	Within-tree multinomial; top-level progressive	Same
Init strategy	Uniform(−2, 2), 100 retries	Same
Divergence threshold	\|ΔH\| > 1000	Same
Step-size re-search	After every metric update	Same

Python: reproduce the 3.21× case in one page

The v10 progressive sampling behavior is built into the core NUTS transition (no extra flag). The snippet below shows the shape of the Python API used by the benchmark harness.

python

import json
import nextstat
import nextstat.bayes as bayes

# Load the benchmark dataset (x, y, group_idx, n_groups).
spec = json.loads(open("hier_logistic_ri.json").read())

model = nextstat.ComposedGlmModel.logistic_regression(
    spec["x"],
    spec["y"],
    include_intercept=True,
    group_idx=spec["group_idx"],
    n_groups=spec["n_groups"],
    random_intercept_non_centered=True,
)

idata = bayes.sample(
    model,
    n_chains=4,
    n_warmup=1000,
    n_samples=2000,
    seed=42,
    target_accept=0.8,
)

quality = idata.attrs["nextstat_diagnostics"]["quality"]
print("min ESS (bulk):", quality["min_ess_bulk"])
print("max R-hat:", quality["max_r_hat"])

6.Benchmark Protocol

6.1 Cases

›GLM logistic regression (6 parameters) — well-conditioned baseline.
›Hierarchical logistic random intercept (22 parameters, non-centered) — funnel-like geometry.
›Eight Schools (10 parameters, non-centered) — classic hierarchical benchmark.

6.2 Settings

text

All cases:
  4 chains, 1000 warmup, 2000 samples
  Diagonal metric
  target_accept = 0.8 (Eight Schools: 0.95)
  Dataset seed: 12345 (fixed)
  Chain seeds: {42, 0, 123}

6.3 Artifacts

Each run emits per-case JSON artifacts under nextstat.bayesian_benchmark_result.v1, a suite index JSON, and an aggregated multi-seed summary. All v10 results are checked into benchmarks/nextstat-public-benchmarks/suites/bayesian/results_v10/.

7.Results (February 2026)

Hardware: AMD EPYC 7502P (32C/64T), 128 GB RAM. CmdStan baseline: 2.38.0 (released 2026-01-13). All runs pass health gates: 0 divergences, max R̂ < 1.01.

7.1 ESS/sec (averaged across 3 seeds)

Bars show min bulk ESS/sec averaged across 3 seeds (Table 7.1). Gold = NextStat, gray = CmdStan.

Model	NextStat	CmdStan	Ratio
GLM logistic (6p)	29,895 ± 668	29,600 ± 1,971	1.01×
Hier logistic RI (22p)	3,255 ± 360	1,015 ± 189	3.21×
Eight Schools NCP (10p)	54,083 ± 4,902	26,644 ± 2,221	2.03×

7.2 Health gates (worst across seeds)

Model	Backend	Max R̂	Min E-BFMI
GLM logistic	nextstat	1.001	1.011
GLM logistic	cmdstanpy	1.002	—
Hier logistic RI	nextstat	1.004	0.691
Hier logistic RI	cmdstanpy	1.002	—
Eight Schools	nextstat	1.003	0.883
Eight Schools	cmdstanpy	1.002	—

7.3 ESS/leapfrog (algorithmic efficiency)

This diagnostic isolates sampler efficiency per unit of Hamiltonian integration — the metric that revealed the v9 proposal-selection mismatch.

Case	NextStat	CmdStan	Ratio
GLM logistic (6p)	0.187 ± 0.006	0.164 ± 0.002	1.14×
Hier logistic RI (22p)	0.044 ± 0.009	0.024 ± 0.005	1.82×
Eight Schools NCP (10p)	0.052 ± 0.003	0.046 ± 0.004	1.14×

ESS is computed via arviz_ess_bulk_min (supplementary; not part of v1 public schemas). Leapfrog totals are summed over post-warmup draws.

7.4 ESS/sec decomposition (implied)

Using the identity: (ESS/sec ratio) ≈ (ESS/LF ratio) × (LF/sec ratio).

Case	ESS/sec ratio	ESS/LF ratio	Implied LF/sec
GLM logistic (6p)	1.01×	1.14×	0.88×
Hier logistic RI (22p)	3.21×	1.82×	1.76×
Eight Schools NCP (10p)	2.03×	1.14×	1.78×

The fix: in v9, ESS/leapfrog on GLM was 0.112 (57% of CmdStan). Switching the top-level join from multinomial to progressive sampling doubled it to 0.228 (117% of CmdStan). Tree depths, step sizes, and accept rates were unchanged — only ESS/leapfrog detected the issue.

8.Discussion

8.1 What changed in v10

In v9, trajectory-level diagnostics (step size, acceptance, tree depth) matched closely, but ESS/leapfrog was substantially lower on a GLM posterior. This pinpointed a proposal-selection issue rather than an integration or adaptation issue. Switching the top-level join to progressive sampling substantially improved ESS/leapfrog and improved ESS/sec across all tested geometries.

8.2 Threats to validity

Even with a pinned protocol, cross-framework comparisons remain sensitive to:

›Gradient implementation differences (analytical vs AD).
›Process orchestration overhead (single-process vs multi-process chains).
›CPU vectorization and BLAS configuration.
›Parameterization (centered vs non-centered).

We treat these as part of the experimental context and publish full artifacts so others can rerun the exact protocol.

9.Conclusion

ESS/leapfrog is an effective diagnostic to separate algorithmic sampling quality from wall-time implementation costs. Under a fixed multi-seed benchmark protocol, a single proposal-selection change (top-level progressive sampling) substantially improved sampler efficiency and yields strong ESS/sec results on hierarchical posteriors.

For full reproducibility, the pinned protocol, JSON artifacts, and figures are checked into benchmarks/nextstat-public-benchmarks/suites/bayesian/results_v10/.

Try NextStat now: Run NUTS v10 instantly in your browser: WASM Playground.Use it in Python: pip install nextstat (see Installation).Support open-source: Star NextStat on GitHub.

References

›Hoffman, M. D., Gelman, A. The No-U-Turn Sampler. JMLR 15(1), 1593–1623 (2014).
›Betancourt, M. A Conceptual Introduction to Hamiltonian Monte Carlo. arXiv:1701.02434 (2017).
›Carpenter, B., et al. Stan: A Probabilistic Programming Language. J. Stat. Soft. 76(1) (2017).
›Stan Development Team. Stan Reference Manual. Version 2.38 (2026).
›Stan Development Team. Release of CmdStan 2.38. January 13, 2026.
›Vehtari, A., et al. Rank-normalization, folding, and localization: An improved R̂. Bayesian Analysis (2021).
›Langenbruch, C. MoreFit: A More Optimised, Rapid and Efficient Fit. arXiv:2505.12414 (2025).

NUTS v10: Progressive Sampling,ESS/Leapfrog, and Reproducible Benchmarks

1.Introduction

Contributions

2.Background: NUTS With Multinomial Trajectory Sampling

2.1 Multinomial weights

2.2 Generalized U-turn criterion

3.Progressive Sampling in the Tree-Doubling Loop

3.1 Why this matters (equal-weight regime)

4.Diagnostics and the ESS/Leapfrog Decomposition

4.1 Health gates

4.2 ESS/sec

4.3 ESS/leapfrog — the key diagnostic

5.Implementation Notes

5.1 Code sketch (top-level join)

Python: reproduce the 3.21× case in one page

6.Benchmark Protocol

6.1 Cases

6.2 Settings

6.3 Artifacts

7.Results (February 2026)

7.1 ESS/sec (averaged across 3 seeds)

7.2 Health gates (worst across seeds)

7.3 ESS/leapfrog (algorithmic efficiency)

7.4 ESS/sec decomposition (implied)

8.Discussion

8.1 What changed in v10

8.2 Threats to validity

9.Conclusion

References

NUTS v10: Progressive Sampling,
ESS/Leapfrog, and Reproducible Benchmarks