ROOT/HistFactory 3-Way Comparison
Comprehensive validation of NextStat against both pyhf (specification reference) and ROOT/RooFit (legacy implementation) on canonical HistFactory fixtures. Agreement with pyhf is sub-1e-5 on q(μ) across all fixtures.
Executive Summary
| Fixture | Modifiers | NS vs pyhf |dq(μ)| | NS vs ROOT |dq(μ)| | ROOT Status |
|---|---|---|---|---|
| xmlimport | OverallSys + StatError | 1e-7 | 0.051 | 0 (converged) |
| multichannel | ShapeSys | 4e-7 | 3.4e-8 | 0 (converged) |
| coupled_histosys | HistoSys (coupled NP) | 5e-6 | 22.5 | -1 (FAILED) |
Real-World TREx Exports
| Case | NS vs ROOT |dq(μ)| | ROOT Issue |
|---|---|---|
| simple_fixture | 1.6e-10 | None (perfect) |
| histfactory_fixture | 1.89 | Optimizer divergence |
| hepdata EWK | 0.0 | Free fit blowup (μ̂ = 4.9e23) |
| tttt-prod (249 params) | 0.04 | Tail optimizer convergence |
Methodology
Each fixture is processed through three independent pipelines reading the same XML + ROOT histograms:
HistFactory XML + ROOT histograms
│
├──► hist2workspace → RooFit → ROOT profile scan (C++ via PyROOT)
├──► pyhf.readxml → pyhf → pyhf profile scan (Python)
└──► NextStat import → PreparedModel → NextStat profile scan (Rust)The profile scan computes q̃(μ) = 2·[NLL(μ) − NLL(μ̂)] at 31 evenly spaced points in μ = [0, 3]. Test statistic: standard q_mu_tilde (Cowan et al., arXiv:1007.1727).
Detailed q(μ) Comparison
xmlimport — ROOT vs NextStat vs pyhf
| μ | ROOT q(μ) | pyhf q(μ) | NS q(μ) | NS − pyhf | ROOT − NS |
|---|---|---|---|---|---|
| 1.2 | 0.01957 | 0.01956 | 0.01956 | +1e-8 | +1e-5 |
| 2.0 | 2.07272 | 2.06669 | 2.06669 | −4e-7 | +6e-3 |
| 3.0 | 9.05788 | 9.00676 | 9.00676 | +1e-7 | +5.1e-2 |
NextStat and pyhf are numerically identical (Δ < 1e-6). ROOT systematically overshoots at high μ — consistent with Minuit2's conditional minimizer converging to slightly higher NLL at extreme values.
coupled_histosys — ROOT divergence
| μ | ROOT q(μ) | pyhf q(μ) | NS q(μ) | NS − pyhf | ROOT − NS |
|---|---|---|---|---|---|
| 1.0 | 0.991 | 0.445 | 0.445 | +4e-6 | +0.545 |
| 2.0 | 15.526 | 6.543 | 6.543 | +5e-6 | +8.98 |
| 3.0 | 41.566 | 19.042 | 19.042 | +4e-6 | +22.52 |
NextStat and pyhf agree to < 1e-5. ROOT gives completely different results starting from μ = 1.0, with divergence growing with μ. ROOT reports status_free = -1 (Minuit2 could not determine a positive-definite covariance matrix).
Root Cause: Why ROOT Diverges
The NLL offset between ROOT and NextStat should be constant across all μ values (it represents the parameter-independent constraint constant). For coupled_histosys:
| Point | ROOT NLL | NS NLL | Offset |
|---|---|---|---|
| Free fit | 434.754 | 14.017 | 420.737 |
| μ = 0.0 | 434.841 | 14.103 | 420.738 |
| μ = 2.0 | 442.517 | 17.288 | 425.229 |
| μ = 3.0 | 455.537 | 23.537 | 432.000 |
The offset grows from 420.74 to 432.0 — this rules out a pure optimizer difference and indicates ROOT evaluates the coupled HistoSys likelihood differently at large alpha values.
Timing Comparison
| Fixture | ROOT | pyhf | NextStat | NS/ROOT | NS/pyhf |
|---|---|---|---|---|---|
| xmlimport | 0.91 s | 0.23 s | 0.003 s | 303× | 73× |
| multichannel | 1.98 s | 0.26 s | 0.007 s | 283× | 37× |
| coupled_histosys | 1.76 s | 0.15 s | 0.002 s | 880× | 75× |
Validation Hierarchy
SPECIFICATION (mathematical definition, arXiv:1007.1727)
│
├── pyhf (ATLAS reference implementation)
│ │
│ ├── NextStat ✓ < 1e-5 on q(μ), CI-gated
│ │
│ └── ROOT/RooFit
│ - ShapeSys: < 1e-6 (excellent)
│ - OverallSys: < 0.05 (optimizer)
│ - Coupled HistoSys: DIVERGES (status=-1)
│ NOT CI-gated (informational only)Reproducing These Results
python tests/validate_root_profile_scan.py \
--histfactory-xml tests/fixtures/pyhf_xmlimport/config/example.xml \
--rootdir tests/fixtures/pyhf_xmlimport \
--include-pyhf --keep