pyhf Parity Contract
NextStat validates numerical correctness against pyhf (NumPy backend, f64) as the canonical oracle. A 7-tier tolerance hierarchy governs the expected agreement from per-bin arithmetic (1e-12) to toy ensemble statistics (0.05).
Evaluation Modes
| Mode | Summation | Backend | Threads | Use Case |
|---|
| Parity | Kahan compensated | SIMD (Accelerate disabled) | 1 (forced) | Validation vs pyhf |
| Fast | Naive | SIMD / Accelerate / CUDA | Rayon (auto) | Production inference |
Activating Parity Mode
# CLI
nextstat fit --input workspace.json --parity
# Python
import nextstat
nextstat.set_eval_mode("parity")
# Rust
ns_compute::set_eval_mode(ns_compute::EvalMode::Parity);
Parity mode sets a process-wide atomic flag: Apple Accelerate is disabled, threads forced to 1, and Kahan compensated summation is used for Poisson NLL.
7-Tier Tolerance Hierarchy
Tier 1 — Per-Bin Expected Data (Pure Arithmetic)
| Metric | Parity Tolerance | Fast Tolerance |
|---|
abs(ns_bin − pyhf_bin) | 1e-12 | 1e-10 |
Identical f64 arithmetic with no reduction over bins. Same operations, same order → identical results.
Tier 2 — Expected Data Vector (Accumulated)
| Metric | Parity Tolerance | Fast Tolerance |
|---|
max(abs(ns − pyhf)) all bins | 1e-8 | 1e-8 |
Tier 3 — NLL Value (Scalar Reduction)
| Metric | Parity Tolerance | Fast Tolerance |
|---|
abs(2·ns_nll − pyhf_twice_nll) | atol 1e-8 | rtol 1e-6 |
Summing λ − n·ln(λ) + ln(n!) over ~1000 bins accumulates rounding at ~1e-10. Kahan summation in Parity mode reduces this to ~1e-14.
Tier 4 — Gradient (AD vs Finite-Difference)
| Metric | Parity Tolerance | Fast Tolerance |
|---|
abs(ns_grad_i − pyhf_fd_i) | atol 1e-6, rtol 1e-4 | same |
NextStat uses reverse-mode AD (exact). pyhf uses central finite-difference (h=1e-5). The tolerance reflects FD noise, not NextStat error.
Tier 5 — Best-Fit Parameters (Optimizer Surface)
| Metric | Parity Tolerance | Fast Tolerance |
|---|
abs(ns_hat_i − pyhf_hat_i) | 2e-4 | 2e-4 |
Near the NLL minimum the surface is flat. Small differences in gradient accuracy or stopping criteria shift the minimum by O(1e-4) in parameter space while changing NLL by O(1e-10). On large models (>100 params), L-BFGS-B often finds a lower NLL than SLSQP.
Tier 6 — Parameter Uncertainties (Hessian Sensitivity)
| Metric | Parity Tolerance | Fast Tolerance |
|---|
abs(ns_unc_i − pyhf_unc_i) | 5e-4 | 5e-4 |
Tier 7 — Toy Ensemble (Statistical)
| Metric | Tolerance | Rationale |
|---|
| Δmean(pull_mu) | 0.05 | Statistical noise from finite N_toys |
| Δstd(pull_mu) | 0.05 | Same |
| Δcoverage_1σ | 0.03 | Binomial noise |
| Δcoverage | 0.05 | General coverage |
Visual Summary
Tighter ◄──────────────────────────────────────── Looser
1e-12 1e-10 1e-8 1e-6 1e-4 1e-2
│ │ │ │ │ │
├─ Tier 1: per-bin expected_data (1e-12)
│ │ │ │
│ ├─ Tier 2: expected_data vector (1e-8)
│ │ │ │
│ │ ├─ Tier 3: NLL value (atol 1e-8, rtol 1e-6)
│ │ │ │
│ │ │ ├─ Tier 4: gradient (atol 1e-6, rtol 1e-4)
│ │ │ │ │
│ │ │ │ ├─ Tier 5: best-fit params (2e-4)
│ │ │ │ ├─ Tier 6: uncertainties (5e-4)
│ │ │ │ │ │
│ │ │ │ │ ├─ Tier 7: toys (0.03–0.05)
Architecture: Spec vs Speed
┌──────────────────────────────────────────────────────┐
│ pyhf NumPy = SPEC (oracle for CI/regressions) │
│ expected_data, nll, grad_nll at fixed points │
└───────────────────┬──────────────────────────────────┘
│ tolerance contract (7 tiers)
┌───────────────────┴──────────────────────────────────┐
│ PreparedModel = COMPILED MODEL │
│ workspace JSON → flat layout: observed_flat, │
│ ln_factorials, obs_mask, modifier CSR indices │
├──────────────────┬───────────────────────────────────┤
│ Parity mode │ Fast mode │
│ Kahan, 1 thread │ Accelerate/CUDA, Rayon │
│ CI-gated │ production │
└──────────────────┴───────────────────────────────────┘
Performance
| Operation | Throughput |
|---|
| PreparedModel NLL eval (simple) | ~200K evals/sec |
| Kahan overhead | <5% |
| Batch toys (1000, CPU Rayon) | ~50× vs serial |
| Batch toys (1000, CUDA) | ~200× vs serial |
| TTree parse + histogram fill | ~8.5× vs uproot+numpy |
| Ranking (16 NPs, autodiff) | ~4× vs pyhf FD |
ROOT Cross-Validation
| Fixture | NS vs pyhf max |dq(μ)| | NS vs ROOT max |dq(μ)| | ROOT status |
|---|
| xmlimport | 1e-7 | 0.051 | 0 (ok) |
| multichannel | 4e-7 | 3.4e-8 | 0 (ok) |
| coupled_histosys | 5e-6 | 22.5 | -1 (FAILED) |
pyhf is the specification reference. ROOT deviations are informational, not gating. Full analysis: ROOT/HistFactory Comparison.
CI Integration
# Full parity suite (deterministic, single-thread)
pytest tests/python/test_pyhf_validation.py \
tests/python/test_expected_data_parity.py -v
# Golden report generation
NEXTSTAT_GOLDEN_REPORT_DIR=reports/ \
pytest tests/python/test_expected_data_parity.py -v