ROOT/HistFactory 3-Way Comparison

Comprehensive validation of NextStat against both pyhf (specification reference) and ROOT/RooFit (legacy implementation) on canonical HistFactory fixtures. Agreement with pyhf is sub-1e-5 on q(μ) across all fixtures.

Executive Summary

Fixture	Modifiers	NS vs pyhf \|dq(μ)\|	NS vs ROOT \|dq(μ)\|	ROOT Status
xmlimport	OverallSys + StatError	1e-7	0.051	0 (converged)
multichannel	ShapeSys	4e-7	3.4e-8	0 (converged)
coupled_histosys	HistoSys (coupled NP)	5e-6	22.5	-1 (FAILED)

Real-World TREx Exports

Case	NS vs ROOT \|dq(μ)\|	ROOT Issue
simple_fixture	1.6e-10	None (perfect)
histfactory_fixture	1.89	Optimizer divergence
hepdata EWK	0.0	Free fit blowup (μ̂ = 4.9e23)
tttt-prod (249 params)	0.04	Tail optimizer convergence

Methodology

Each fixture is processed through three independent pipelines reading the same XML + ROOT histograms:

HistFactory XML + ROOT histograms
        │
        ├──► hist2workspace → RooFit → ROOT profile scan (C++ via PyROOT)
        ├──► pyhf.readxml   → pyhf   → pyhf profile scan  (Python)
        └──► NextStat import → PreparedModel → NextStat profile scan (Rust)

The profile scan computes q̃(μ) = 2·[NLL(μ) − NLL(μ̂)] at 31 evenly spaced points in μ = [0, 3]. Test statistic: standard q_mu_tilde (Cowan et al., arXiv:1007.1727).

Detailed q(μ) Comparison

xmlimport — ROOT vs NextStat vs pyhf

μ	ROOT q(μ)	pyhf q(μ)	NS q(μ)	NS − pyhf	ROOT − NS
1.2	0.01957	0.01956	0.01956	+1e-8	+1e-5
2.0	2.07272	2.06669	2.06669	−4e-7	+6e-3
3.0	9.05788	9.00676	9.00676	+1e-7	+5.1e-2

NextStat and pyhf are numerically identical (Δ < 1e-6). ROOT systematically overshoots at high μ — consistent with Minuit2's conditional minimizer converging to slightly higher NLL at extreme values.

coupled_histosys — ROOT divergence

μ	ROOT q(μ)	pyhf q(μ)	NS q(μ)	NS − pyhf	ROOT − NS
1.0	0.991	0.445	0.445	+4e-6	+0.545
2.0	15.526	6.543	6.543	+5e-6	+8.98
3.0	41.566	19.042	19.042	+4e-6	+22.52

NextStat and pyhf agree to < 1e-5. ROOT gives completely different results starting from μ = 1.0, with divergence growing with μ. ROOT reports status_free = -1 (Minuit2 could not determine a positive-definite covariance matrix).

Root Cause: Why ROOT Diverges

The NLL offset between ROOT and NextStat should be constant across all μ values (it represents the parameter-independent constraint constant). For coupled_histosys:

Point	ROOT NLL	NS NLL	Offset
Free fit	434.754	14.017	420.737
μ = 0.0	434.841	14.103	420.738
μ = 2.0	442.517	17.288	425.229
μ = 3.0	455.537	23.537	432.000

The offset grows from 420.74 to 432.0 — this rules out a pure optimizer difference and indicates ROOT evaluates the coupled HistoSys likelihood differently at large alpha values.

Timing Comparison

Fixture	ROOT	pyhf	NextStat	NS/ROOT	NS/pyhf
xmlimport	0.91 s	0.23 s	0.003 s	303×	73×
multichannel	1.98 s	0.26 s	0.007 s	283×	37×
coupled_histosys	1.76 s	0.15 s	0.002 s	880×	75×

Validation Hierarchy

SPECIFICATION (mathematical definition, arXiv:1007.1727)
    │
    ├── pyhf (ATLAS reference implementation)
    │       │
    │       ├── NextStat  ✓  < 1e-5 on q(μ), CI-gated
    │       │
    │       └── ROOT/RooFit
    │           - ShapeSys: < 1e-6 (excellent)
    │           - OverallSys: < 0.05 (optimizer)
    │           - Coupled HistoSys: DIVERGES (status=-1)
    │           NOT CI-gated (informational only)

Reproducing These Results

python tests/validate_root_profile_scan.py \
  --histfactory-xml tests/fixtures/pyhf_xmlimport/config/example.xml \
  --rootdir tests/fixtures/pyhf_xmlimport \
  --include-pyhf --keep