Where ROOT Gets It Wrong
A rigorous numerical comparison of HistFactory implementations: ROOT/RooFit, pyhf, and NextStat on profile likelihood scans.
Abstract. We present a systematic 3-way numerical comparison of HistFactory implementations: ROOT/RooFit (Minuit2), pyhf (SLSQP), and NextStat (L-BFGS-B). NextStat and pyhf agree on q(μ) to < 1e-5 across all fixtures. ROOT exhibits three failure modes: optimizer non-convergence, catastrophic free-fit divergence (μ̂ = 4.9e23), and systematic q(μ) overestimation. On large models (>100 params), L-BFGS-B consistently finds deeper minima. All results are fully reproducible.
Key Takeaways
- ›ShapeSys: ROOT, pyhf, NextStat agree to < 1e-6 when fits converge
- ›OverallSys: ROOT inflates q(μ) by up to 0.6% at high μ (Minuit2 conditional fits)
- ›Coupled HistoSys: ROOT diverges by Δq(μ) = 22.5 at μ = 3.0; NLL offset is non-constant → model-level discrepancy
- ›Large models: L-BFGS-B finds deeper stationary point (‖grad‖ = 0.020 vs 4.63 on 184-param workspace)
- ›Performance: NextStat is 37×–880× faster; +6.4× with GPU for toys on NVIDIA hardware
1.Introduction
The HistFactory model is the standard statistical model for binned analyses at the LHC. Its profile likelihood ratio test statistic q̃(μ) is the foundation of exclusion limits and discovery significances at ATLAS and CMS.
Three independent implementations are widely used:
- ›ROOT/RooFit — C++, Minuit2. De facto standard for I/O and visualization at CERN.
- ›pyhf — Pure Python, SLSQP. Formal reference for the HistFactory specification (ATLAS ATL-PHYS-PUB-2019-029).
- ›NextStat — Rust, L-BFGS-B. Reverse-mode AD, optional CUDA/Metal GPU acceleration.
Do these three implementations produce the same physics results? If not, which are correct? We answer with a rigorous 3-way comparison covering 31-point profile scans on canonical fixtures exercising all principal modifier types.
2.Methodology
2.1 Validation Pipeline
HistFactory XML + ROOT histograms
|
+---> hist2workspace -> RooFit -> ROOT profile scan (C++/PyROOT)
+---> pyhf.readxml -> pyhf -> pyhf profile scan (Python)
+---> NextStat import -> Rust -> NextStat profile scan (Rust)The profile scan computes q̃(μ) at 31 evenly spaced points in μ = [0, 3]. All three pipelines use bounded optimization with the POI constrained to μ ≥ 0.
2.2 Test Statistic
⎧ 2 · [ NLL(μ, θ̂_μ) − NLL(μ̂, θ̂) ] if μ̂ ≤ μ
q̃(μ) = ⎨
⎩ 0 if μ̂ > μAs defined in Cowan et al. [2], implemented identically in all three tools.
2.3 Fixtures
| Fixture | Channels | Key Modifiers | Params |
|---|---|---|---|
| xmlimport | 1 | OverallSys, StatError | ~5 |
| multichannel | 2 | ShapeSys (Poisson) | ~10 |
| coupled_histosys | 2 | HistoSys (coupled) | ~5 |
We additionally test real-world TRExFitter exports with up to 249 nuisance parameters.
2.4 Interpolation Codes
- ›NormSys (OverallSys): Code 4 (polynomial, 6th order)
- ›HistoSys: Code 4p (piecewise polynomial)
This eliminates interpolation choice as a source of discrepancy.
2.5 Versions and Environment
| ROOT | 6.38.00 (hist2workspace + RooFit/Minuit2) |
| pyhf | 0.7.6 (SciPy 1.17.0, NumPy 2.4.2) |
| NextStat | 0.9.0 (git c186c29) |
| Host | Apple M5 (arm64), macOS 26.2 |
Raw scan JSON is snapshotted under docs/blog/assets/numerical-accuracy/data/ and rendered into SVGs by scripts/blog/generate_numerical_accuracy_plots.py.
3.Results
3.1 The Good: ShapeSys — Near-Perfect 3-Way Agreement
The multichannel fixture with ShapeSys modifiers achieves the gold standard: all three implementations agree to better than 1e-6 on q(μ):
| μ | ROOT q(μ) | pyhf q(μ) | NextStat q(μ) | NS−pyhf | ROOT−NS |
|---|---|---|---|---|---|
| 1.0 | 3.10348 | 3.10348 | 3.10348 | -8e-7 | ~0 |
| 2.0 | 15.5334 | 15.5334 | 15.5334 | -4e-7 | -3e-8 |
| 3.0 | 36.2352 | 36.2352 | 36.2352 | -4e-7 | -3e-8 |
When all three optimizers converge, the implementations produce identical physics results.
3.2 The Bad: OverallSys — ROOT Overestimates at High μ
The xmlimport fixture reveals a systematic bias in ROOT at high signal strengths:
| μ | ROOT q(μ) | pyhf q(μ) | NextStat q(μ) | ROOT−NS |
|---|---|---|---|---|
| 1.2 | 0.01957 | 0.01956 | 0.01956 | +1e-5 |
| 2.0 | 2.07272 | 2.06669 | 2.06669 | +6e-3 |
| 3.0 | 9.05788 | 9.00676 | 9.00676 | +5.1e-2 |
NextStat and pyhf agree to 1e-7. ROOT overestimates q(μ) by up to 0.051 at μ = 3.0 (0.6% relative), growing linearly with distance from μ̂.
Diagnosis: The NLL offset between ROOT and NextStat is constant (11.06 ± 0.001), confirming identical model evaluation. The growing q(μ) delta is purely an optimizer effect: Minuit2's conditional minimizer converges to a slightly higher NLL at extreme μ values. ROOT would report a tighter exclusion limit than warranted by the data.
3.3 The Ugly: Coupled HistoSys — ROOT Fails Completely
The coupled_histosys fixture produces the most dramatic divergence:
| μ | ROOT q(μ) | pyhf q(μ) | NextStat q(μ) | ROOT−NS |
|---|---|---|---|---|
| 0.5 | 0.057 | 0.057 | 0.057 | ~0 |
| 1.0 | 0.991 | 0.445 | 0.445 | +0.545 |
| 2.0 | 15.526 | 6.543 | 6.543 | +8.98 |
| 3.0 | 41.566 | 19.042 | 19.042 | +22.52 |
ROOT's free fit reports status = -1 (Minuit2 could not determine a positive-definite covariance matrix). The conditional fits at μ ≥ 1.0 diverge catastrophically.
Critical evidence: the NLL offset is not constant. If this were a pure optimizer issue, the offset between ROOT and NextStat would be constant across all μ values. We observe it growing from 420.74 to 432.00 — ruling out a pure optimizer difference and indicating ROOT evaluates the coupled HistoSys likelihood differently at large nuisance parameter values.
| Eval point | ROOT NLL | NextStat NLL | Offset |
|---|---|---|---|
| Free fit (μ̂) | 434.754 | 14.017 | 420.74 |
| μ = 0 | 434.841 | 14.103 | 420.74 |
| μ = 1 | 435.250 | 14.239 | 421.01 |
| μ = 2 | 442.517 | 17.288 | 425.23 |
| μ = 3 | 455.537 | 23.537 | 432.00 |
3.4 Real-World Analyses
| Analysis | Params | max Δq NS vs pyhf | max Δq NS vs ROOT | ROOT issue |
|---|---|---|---|---|
| simple fixture | ~5 | < 1e-8 | 1.6e-10 | None |
| EWK (HEPData) | medium | < 1e-5 | 0.0 | μ̂ = 4.9e23 |
| tttt-prod | 249 | < 1e-5 | 0.04 | Tail convergence |
The EWK analysis is striking: ROOT's unconditional fit diverged so severely that Minuit2 returned μ̂ = 4.9e23 with NLL = 1.8e23. NextStat converges normally to μ̂ = 6.57. Since μ̂ > all scan points, all q̃ values happen to be zero — masking the catastrophic failure.
4.Optimizer Quality on Large Models
Beyond the 3-way comparison, we investigate optimizer behavior on models with >100 nuisance parameters. NextStat (L-BFGS-B) and pyhf (SLSQP) are compared via cross-evaluation: each tool evaluates its objective at the other's best-fit parameters.
4.1 Cross-Evaluation Results
| Model | Params | ΔNLL (NS−pyhf) | ‖grad‖ pyhf | ‖grad‖ NS |
|---|---|---|---|---|
| simple | 3 | 0.0 | < 1e-6 | < 1e-6 |
| complex | 8 | 0.0 | < 1e-6 | < 1e-6 |
| tHu | 184 | -0.081 | 4.63 | 0.020 |
| tttt | 249 | -0.010 | 1.44 | 0.008 |
On 184+ parameter models, NextStat finds NLL values 0.01–0.08 lower. Cross-evaluation confirms objective parity at ~1e-13: the difference is purely optimizer quality.
4.2 Gradient Norm Analysis
The projected gradient norm at the best-fit point is a direct measure of optimizer quality. At pyhf's optimum on the tHu workspace: ‖grad‖ = 4.63 (not stationary). At NextStat's optimum: ‖grad‖ = 0.020 (stationary to numerical precision).
4.3 Can pyhf Recover the Better Minimum?
When warm-started at NextStat's parameters, pyhf recovers NLL = 179.404 — confirming the minimum is genuine. However, pyhf's multi-start search (20 random inits) only converges in 5/20 runs, with the best result still 0.004 above NextStat's minimum.
4.4 Technical Explanation
L-BFGS-B maintains a limited-memory inverse Hessian approximation using the most recent m = 10 (s, y) pairs — O(mn) per iteration with effective curvature information. SLSQP can struggle on high-dimensional HistFactory likelihoods with tight bounds and correlated nuisance parameters, where the NLL surface contains narrow valleys.
5.Performance
Profile scan timing (31 μ-points, including free fit):
| Fixture | ROOT | pyhf | NextStat | vs ROOT | vs pyhf |
|---|---|---|---|---|---|
| xmlimport | 0.91 s | 0.23 s | 0.003 s | 303× | 73× |
| multichannel | 1.98 s | 0.26 s | 0.007 s | 283× | 37× |
| coupled_histosys | 1.76 s | 0.15 s | 0.002 s | 880× | 75× |
Speed comes from: compiled Rust (vs Python overhead), reverse-mode AD (vs finite-difference gradients), and zero-allocation hot path with pre-compiled modifier evaluation. CPU: Apple M5 (arm64). GPU toy benchmark: NVIDIA RTX 4000, additional 6.4× over CPU.
6.Validation Methodology
6.1 Seven-Tier Tolerance Contract
| Tier | Metric | Tolerance | Meaning |
|---|---|---|---|
| 1 | Per-bin expected data | 1e-12 | Pure f64 arithmetic |
| 2 | Full expected vector | 1e-8 | Summation order |
| 3 | NLL value | atol=1e-8, rtol=1e-6 | Reduction noise |
| 4 | Gradient | atol=1e-6, rtol=1e-4 | AD vs FD |
| 5 | Best-fit parameters | 2e-4 | Flat near minimum |
| 6 | Uncertainties | 5e-4 | Hessian sensitivity |
| 7 | Toy ensemble stats | 0.03–0.05 | MC noise |
All tiers are CI-gated against pyhf's NumPy backend.
6.2 Why pyhf is the Reference, Not ROOT
- ›Published in JOSS [5] with peer review
- ›Used for published ATLAS results
- ›Maintained by ATLAS physicists (Heinrich, Feickert, Stark)
- ›Origin of the JSON workspace interchange format
- ›Tested against ROOT's output in its own CI
The HistFactory specification is defined by the mathematical model [1], not by any software implementation. Validating against pyhf IS validating against the specification.
6.3 Reproducibility
# 3-way profile scan comparison (ROOT + NextStat + optional pyhf scan)
PYTHONPATH=bindings/ns-py/python ./.venv/bin/python tests/validate_root_profile_scan.py \
--histfactory-xml tests/fixtures/pyhf_xmlimport/config/example.xml \
--rootdir tests/fixtures/pyhf_xmlimport \
--include-pyhf --keep
# Cross-evaluation / optimizer diagnostics
PYTHONPATH=bindings/ns-py/python ./.venv/bin/python tests/diagnose_optimizer.py \
--workspace tests/fixtures/workspace_tHu.json \
--workspace tests/fixtures/tttt-prod_workspace.json \
--multi-start 20
# Regenerate figures (reads snapshot data)
python3 scripts/blog/generate_numerical_accuracy_plots.py7.Discussion
7.1 Implications for Published Results
- ›OverallSys bias: 0.6% overestimation at μ = 3.0 translates to a slightly tighter exclusion limit. For most analyses negligible, but it is a systematic bias growing with distance from μ̂.
- ›Coupled HistoSys failure: q(μ) = 41.6 vs 19.0 at μ = 3.0. In asymptotic approximation: √41.6 − √19.0 = 6.4σ − 4.4σ = 2.0σ overestimate.
This affects only analyses using coupled HistoSys modifiers at extreme parameter values. The majority of analyses use OverallSys, where ROOT's bias is small.
7.2 The Optimizer Matters
Cross-evaluation (§4) demonstrates unambiguously that the NLL landscape is identical — all differences arise from the optimizer. For <50 parameters, all three optimizers converge to the same minimum. For larger models, the choice of optimizer is a physics-relevant decision.
7.3 What NextStat Does Differently
- ›Exact gradients: Reverse-mode AD provides machine-precision gradients, eliminating O(h²) truncation error of finite differences.
- ›Zero-allocation optimization: NLL hot path allocates zero heap memory, reducing cache pressure.
- ›Specification-first validation: Every CI run gates against pyhf, not ROOT. If ROOT disagrees, we investigate — but we don't break CI to match ROOT's bugs.
7.4 Practical Safeguards
- ›Treat non-zero fit status / failed covariance as a hard failure
- ›Cross-check with an independent implementation (pyhf) or cross-evaluate NLL at best-fit points
- ›Track convergence metric (projected gradient norm / KKT residuals) on high-dimensional models
- ›Warm-start conditional fits along μ scan from neighboring points
- ›Record tool versions and fit tolerances with every published limit
8.Conclusion
We have demonstrated through systematic comparison that:
- ›NextStat and pyhf agree to < 1e-5 on q(μ) across all canonical fixtures and real-world analyses with up to 249 parameters.
- ›ROOT/RooFit exhibits three failure modes: optimizer non-convergence (coupled HistoSys), catastrophic divergence (μ̂ = 4.9e23), and systematic q(μ) overestimation (OverallSys, up to 0.6%).
- ›On large models (>100 params), L-BFGS-B finds NLL values 0.01–0.08 lower than SLSQP, verified at ~1e-13 objective parity.
- ›NextStat is 37×–880× faster than ROOT and 37×–75× faster than pyhf, with additional 6.4× GPU acceleration.
ROOT is the standard for I/O and visualization, but for statistical inference, independent implementations with modern optimizers and rigorous validation against the mathematical specification provide more reliable results.
Versions used for this post
| ROOT | 6.38.00 (hist2workspace + RooFit/Minuit2) |
| pyhf | 0.7.6 (SciPy 1.17.0, NumPy 2.4.2) |
| NextStat | 0.9.0 (git c186c29, 2026-02-10) |
| CPU | Apple M5 (arm64), macOS 26.2 |
| GPU (toys) | NVIDIA RTX 4000 |
Timing figures measured on M5 CPU. GPU toy speedup on RTX 4000. Raw data snapshotted under docs/blog/assets/numerical-accuracy/data/.
References
- K. Cranmer et al. "HistFactory: A tool for creating statistical models for use with RooFit and RooStats." CERN-OPEN-2012-016, 2012.
- G. Cowan et al. "Asymptotic formulae for likelihood-based tests of new physics." Eur. Phys. J. C 71 (2011) 1554. arXiv:1007.1727.
- R. Brun, F. Rademakers. "ROOT — An object oriented data analysis framework." Nucl. Instrum. Meth. A 389 (1997) 81-86.
- F. James, M. Roos. "Minuit — a system for function minimization." Comput. Phys. Commun. 10 (1975) 343-367.
- L. Heinrich et al. "pyhf: pure-Python implementation of HistFactory statistical models." JOSS 6 (2021) 2823. doi:10.21105/joss.02823.
Validation Scripts
Profile scan: tests/validate_root_profile_scan.py
Optimizer diagnostics: tests/diagnose_optimizer.py
Plot generator: scripts/blog/generate_numerical_accuracy_plots.py
Snapshot data: docs/blog/assets/numerical-accuracy/data/
