pyhf Parity Contract

NextStat validates numerical correctness against pyhf (NumPy backend, f64) as the canonical oracle. A 7-tier tolerance hierarchy governs the expected agreement from per-bin arithmetic (1e-12) to toy ensemble statistics (0.05).

Evaluation Modes

Mode	Summation	Backend	Threads	Use Case
Parity	Kahan compensated	SIMD (Accelerate disabled)	1 (forced)	Validation vs pyhf
Fast	Naive	SIMD / Accelerate / CUDA	Rayon (auto)	Production inference

Activating Parity Mode

# CLI
nextstat fit --input workspace.json --parity

# Python
import nextstat
nextstat.set_eval_mode("parity")

# Rust
ns_compute::set_eval_mode(ns_compute::EvalMode::Parity);

Parity mode sets a process-wide atomic flag: Apple Accelerate is disabled, threads forced to 1, and Kahan compensated summation is used for Poisson NLL.

7-Tier Tolerance Hierarchy

Tier 1 — Per-Bin Expected Data (Pure Arithmetic)

Metric	Parity Tolerance	Fast Tolerance
`abs(ns_bin − pyhf_bin)`	1e-12	1e-10

Identical f64 arithmetic with no reduction over bins. Same operations, same order → identical results.

Tier 2 — Expected Data Vector (Accumulated)

Metric	Parity Tolerance	Fast Tolerance
`max(abs(ns − pyhf)) all bins`	1e-8	1e-8

Tier 3 — NLL Value (Scalar Reduction)

Metric	Parity Tolerance	Fast Tolerance
`abs(2·ns_nll − pyhf_twice_nll)`	atol 1e-8	rtol 1e-6

Summing λ − n·ln(λ) + ln(n!) over ~1000 bins accumulates rounding at ~1e-10. Kahan summation in Parity mode reduces this to ~1e-14.

Tier 4 — Gradient (AD vs Finite-Difference)

Metric	Parity Tolerance	Fast Tolerance
`abs(ns_grad_i − pyhf_fd_i)`	atol 1e-6, rtol 1e-4	same

NextStat uses reverse-mode AD (exact). pyhf uses central finite-difference (h=1e-5). The tolerance reflects FD noise, not NextStat error.

Tier 5 — Best-Fit Parameters (Optimizer Surface)

Metric	Parity Tolerance	Fast Tolerance
`abs(ns_hat_i − pyhf_hat_i)`	2e-4	2e-4

Near the NLL minimum the surface is flat. Small differences in gradient accuracy or stopping criteria shift the minimum by O(1e-4) in parameter space while changing NLL by O(1e-10). On large models (>100 params), L-BFGS-B often finds a lower NLL than SLSQP.

Tier 6 — Parameter Uncertainties (Hessian Sensitivity)

Metric	Parity Tolerance	Fast Tolerance
`abs(ns_unc_i − pyhf_unc_i)`	5e-4	5e-4

Tier 7 — Toy Ensemble (Statistical)

Metric	Tolerance	Rationale
Δmean(pull_mu)	0.05	Statistical noise from finite N_toys
Δstd(pull_mu)	0.05	Same
Δcoverage_1σ	0.03	Binomial noise
Δcoverage	0.05	General coverage

Visual Summary

Tighter ◄──────────────────────────────────────── Looser

1e-12     1e-10     1e-8      1e-6      1e-4      1e-2
  │         │         │         │         │         │
  ├─ Tier 1: per-bin expected_data (1e-12)
  │         │         │         │
  │         ├─ Tier 2: expected_data vector (1e-8)
  │         │         │         │
  │         │         ├─ Tier 3: NLL value (atol 1e-8, rtol 1e-6)
  │         │         │         │
  │         │         │         ├─ Tier 4: gradient (atol 1e-6, rtol 1e-4)
  │         │         │         │         │
  │         │         │         │         ├─ Tier 5: best-fit params (2e-4)
  │         │         │         │         ├─ Tier 6: uncertainties (5e-4)
  │         │         │         │         │         │
  │         │         │         │         │         ├─ Tier 7: toys (0.03–0.05)

Architecture: Spec vs Speed

┌──────────────────────────────────────────────────────┐
│  pyhf NumPy = SPEC (oracle for CI/regressions)       │
│  expected_data, nll, grad_nll at fixed points         │
└───────────────────┬──────────────────────────────────┘
                    │ tolerance contract (7 tiers)
┌───────────────────┴──────────────────────────────────┐
│  PreparedModel = COMPILED MODEL                       │
│  workspace JSON → flat layout: observed_flat,         │
│  ln_factorials, obs_mask, modifier CSR indices        │
├──────────────────┬───────────────────────────────────┤
│  Parity mode     │  Fast mode                         │
│  Kahan, 1 thread │  Accelerate/CUDA, Rayon            │
│  CI-gated        │  production                        │
└──────────────────┴───────────────────────────────────┘

Performance

Operation	Throughput
PreparedModel NLL eval (simple)	~200K evals/sec
Kahan overhead	<5%
Batch toys (1000, CPU Rayon)	~50× vs serial
Batch toys (1000, CUDA)	~200× vs serial
TTree parse + histogram fill	~8.5× vs uproot+numpy
Ranking (16 NPs, autodiff)	~4× vs pyhf FD

ROOT Cross-Validation

Fixture	NS vs pyhf max \|dq(μ)\|	NS vs ROOT max \|dq(μ)\|	ROOT status
xmlimport	1e-7	0.051	0 (ok)
multichannel	4e-7	3.4e-8	0 (ok)
coupled_histosys	5e-6	22.5	-1 (FAILED)

pyhf is the specification reference. ROOT deviations are informational, not gating. Full analysis: ROOT/HistFactory Comparison.

CI Integration

# Full parity suite (deterministic, single-thread)
pytest tests/python/test_pyhf_validation.py \
       tests/python/test_expected_data_parity.py -v

# Golden report generation
NEXTSTAT_GOLDEN_REPORT_DIR=reports/ \
  pytest tests/python/test_expected_data_parity.py -v