NextStatNextStat

pyhf Parity Contract

NextStat validates numerical correctness against pyhf (NumPy backend, f64) as the canonical oracle. A 7-tier tolerance hierarchy governs the expected agreement from per-bin arithmetic (1e-12) to toy ensemble statistics (0.05).

Evaluation Modes

ModeSummationBackendThreadsUse Case
ParityKahan compensatedSIMD (Accelerate disabled)1 (forced)Validation vs pyhf
FastNaiveSIMD / Accelerate / CUDARayon (auto)Production inference

Activating Parity Mode

# CLI
nextstat fit --input workspace.json --parity

# Python
import nextstat
nextstat.set_eval_mode("parity")

# Rust
ns_compute::set_eval_mode(ns_compute::EvalMode::Parity);

Parity mode sets a process-wide atomic flag: Apple Accelerate is disabled, threads forced to 1, and Kahan compensated summation is used for Poisson NLL.

7-Tier Tolerance Hierarchy

Tier 1 — Per-Bin Expected Data (Pure Arithmetic)

MetricParity ToleranceFast Tolerance
abs(ns_bin − pyhf_bin)1e-121e-10

Identical f64 arithmetic with no reduction over bins. Same operations, same order → identical results.

Tier 2 — Expected Data Vector (Accumulated)

MetricParity ToleranceFast Tolerance
max(abs(ns − pyhf)) all bins1e-81e-8

Tier 3 — NLL Value (Scalar Reduction)

MetricParity ToleranceFast Tolerance
abs(2·ns_nll − pyhf_twice_nll)atol 1e-8rtol 1e-6

Summing λ − n·ln(λ) + ln(n!) over ~1000 bins accumulates rounding at ~1e-10. Kahan summation in Parity mode reduces this to ~1e-14.

Tier 4 — Gradient (AD vs Finite-Difference)

MetricParity ToleranceFast Tolerance
abs(ns_grad_i − pyhf_fd_i)atol 1e-6, rtol 1e-4same

NextStat uses reverse-mode AD (exact). pyhf uses central finite-difference (h=1e-5). The tolerance reflects FD noise, not NextStat error.

Tier 5 — Best-Fit Parameters (Optimizer Surface)

MetricParity ToleranceFast Tolerance
abs(ns_hat_i − pyhf_hat_i)2e-42e-4

Near the NLL minimum the surface is flat. Small differences in gradient accuracy or stopping criteria shift the minimum by O(1e-4) in parameter space while changing NLL by O(1e-10). On large models (>100 params), L-BFGS-B often finds a lower NLL than SLSQP.

Tier 6 — Parameter Uncertainties (Hessian Sensitivity)

MetricParity ToleranceFast Tolerance
abs(ns_unc_i − pyhf_unc_i)5e-45e-4

Tier 7 — Toy Ensemble (Statistical)

MetricToleranceRationale
Δmean(pull_mu)0.05Statistical noise from finite N_toys
Δstd(pull_mu)0.05Same
Δcoverage_1σ0.03Binomial noise
Δcoverage0.05General coverage

Visual Summary

Tighter ◄──────────────────────────────────────── Looser

1e-12     1e-10     1e-8      1e-6      1e-4      1e-2
  │         │         │         │         │         │
  ├─ Tier 1: per-bin expected_data (1e-12)
  │         │         │         │
  │         ├─ Tier 2: expected_data vector (1e-8)
  │         │         │         │
  │         │         ├─ Tier 3: NLL value (atol 1e-8, rtol 1e-6)
  │         │         │         │
  │         │         │         ├─ Tier 4: gradient (atol 1e-6, rtol 1e-4)
  │         │         │         │         │
  │         │         │         │         ├─ Tier 5: best-fit params (2e-4)
  │         │         │         │         ├─ Tier 6: uncertainties (5e-4)
  │         │         │         │         │         │
  │         │         │         │         │         ├─ Tier 7: toys (0.03–0.05)

Architecture: Spec vs Speed

┌──────────────────────────────────────────────────────┐
│  pyhf NumPy = SPEC (oracle for CI/regressions)       │
│  expected_data, nll, grad_nll at fixed points         │
└───────────────────┬──────────────────────────────────┘
                    │ tolerance contract (7 tiers)
┌───────────────────┴──────────────────────────────────┐
│  PreparedModel = COMPILED MODEL                       │
│  workspace JSON → flat layout: observed_flat,         │
│  ln_factorials, obs_mask, modifier CSR indices        │
├──────────────────┬───────────────────────────────────┤
│  Parity mode     │  Fast mode                         │
│  Kahan, 1 thread │  Accelerate/CUDA, Rayon            │
│  CI-gated        │  production                        │
└──────────────────┴───────────────────────────────────┘

Performance

OperationThroughput
PreparedModel NLL eval (simple)~200K evals/sec
Kahan overhead<5%
Batch toys (1000, CPU Rayon)~50× vs serial
Batch toys (1000, CUDA)~200× vs serial
TTree parse + histogram fill~8.5× vs uproot+numpy
Ranking (16 NPs, autodiff)~4× vs pyhf FD

ROOT Cross-Validation

FixtureNS vs pyhf max |dq(μ)|NS vs ROOT max |dq(μ)|ROOT status
xmlimport1e-70.0510 (ok)
multichannel4e-73.4e-80 (ok)
coupled_histosys5e-622.5-1 (FAILED)

pyhf is the specification reference. ROOT deviations are informational, not gating. Full analysis: ROOT/HistFactory Comparison.

CI Integration

# Full parity suite (deterministic, single-thread)
pytest tests/python/test_pyhf_validation.py \
       tests/python/test_expected_data_parity.py -v

# Golden report generation
NEXTSTAT_GOLDEN_REPORT_DIR=reports/ \
  pytest tests/python/test_expected_data_parity.py -v