NextStatNextStat
← All posts

Where ROOT Gets It Wrong

A rigorous numerical comparison of HistFactory implementations: ROOT/RooFit, pyhf, and NextStat on profile likelihood scans.

/22 min read
ROOTpyhfHistFactoryValidationMinuit2L-BFGS-B
Trust Offensive series:Index·Prev: HEP Benchmark Harness·Next: Differentiable HistFactory

Abstract. We present a systematic 3-way numerical comparison of HistFactory implementations: ROOT/RooFit (Minuit2), pyhf (SLSQP), and NextStat (L-BFGS-B). NextStat and pyhf agree on q(μ) to < 1e-5 across all fixtures. ROOT exhibits three failure modes: optimizer non-convergence, catastrophic free-fit divergence (μ̂ = 4.9e23), and systematic q(μ) overestimation. On large models (>100 params), L-BFGS-B consistently finds deeper minima. All results are fully reproducible.

Key Takeaways

  • ShapeSys: ROOT, pyhf, NextStat agree to < 1e-6 when fits converge
  • OverallSys: ROOT inflates q(μ) by up to 0.6% at high μ (Minuit2 conditional fits)
  • Coupled HistoSys: ROOT diverges by Δq(μ) = 22.5 at μ = 3.0; NLL offset is non-constant → model-level discrepancy
  • Large models: L-BFGS-B finds deeper stationary point (‖grad‖ = 0.020 vs 4.63 on 184-param workspace)
  • Performance: NextStat is 37×–880× faster; +6.4× with GPU for toys on NVIDIA hardware

1.Introduction

The HistFactory model is the standard statistical model for binned analyses at the LHC. Its profile likelihood ratio test statistic q̃(μ) is the foundation of exclusion limits and discovery significances at ATLAS and CMS.

Three independent implementations are widely used:

  • ROOT/RooFit — C++, Minuit2. De facto standard for I/O and visualization at CERN.
  • pyhf — Pure Python, SLSQP. Formal reference for the HistFactory specification (ATLAS ATL-PHYS-PUB-2019-029).
  • NextStat — Rust, L-BFGS-B. Reverse-mode AD, optional CUDA/Metal GPU acceleration.

Do these three implementations produce the same physics results? If not, which are correct? We answer with a rigorous 3-way comparison covering 31-point profile scans on canonical fixtures exercising all principal modifier types.

2.Methodology

2.1 Validation Pipeline

text
HistFactory XML + ROOT histograms
        |
        +---> hist2workspace -> RooFit    -> ROOT profile scan   (C++/PyROOT)
        +---> pyhf.readxml   -> pyhf      -> pyhf profile scan   (Python)
        +---> NextStat import -> Rust     -> NextStat profile scan (Rust)

The profile scan computes q̃(μ) at 31 evenly spaced points in μ = [0, 3]. All three pipelines use bounded optimization with the POI constrained to μ ≥ 0.

2.2 Test Statistic

              ⎧ 2 · [ NLL(μ, θ̂_μ) − NLL(μ̂, θ̂) ]   if μ̂ ≤ μ
q̃(μ)   =     ⎨
              ⎩ 0                                      if μ̂ > μ

As defined in Cowan et al. [2], implemented identically in all three tools.

2.3 Fixtures

FixtureChannelsKey ModifiersParams
xmlimport1OverallSys, StatError~5
multichannel2ShapeSys (Poisson)~10
coupled_histosys2HistoSys (coupled)~5

We additionally test real-world TRExFitter exports with up to 249 nuisance parameters.

2.4 Interpolation Codes

  • NormSys (OverallSys): Code 4 (polynomial, 6th order)
  • HistoSys: Code 4p (piecewise polynomial)

This eliminates interpolation choice as a source of discrepancy.

2.5 Versions and Environment

ROOT6.38.00 (hist2workspace + RooFit/Minuit2)
pyhf0.7.6 (SciPy 1.17.0, NumPy 2.4.2)
NextStat0.9.0 (git c186c29)
HostApple M5 (arm64), macOS 26.2

Raw scan JSON is snapshotted under docs/blog/assets/numerical-accuracy/data/ and rendered into SVGs by scripts/blog/generate_numerical_accuracy_plots.py.

3.Results

3.1 The Good: ShapeSys — Near-Perfect 3-Way Agreement

The multichannel fixture with ShapeSys modifiers achieves the gold standard: all three implementations agree to better than 1e-6 on q(μ):

μROOT q(μ)pyhf q(μ)NextStat q(μ)NS−pyhfROOT−NS
1.03.103483.103483.10348-8e-7~0
2.015.533415.533415.5334-4e-7-3e-8
3.036.235236.235236.2352-4e-7-3e-8

When all three optimizers converge, the implementations produce identical physics results.

multichannel (ShapeSys): Δq(μ) residuals
Fig. 1 — multichannel (ShapeSys): Δq(μ) residuals across 31 μ-points. All three tools agree to < 1e-6.

3.2 The Bad: OverallSys — ROOT Overestimates at High μ

The xmlimport fixture reveals a systematic bias in ROOT at high signal strengths:

μROOT q(μ)pyhf q(μ)NextStat q(μ)ROOT−NS
1.20.019570.019560.01956+1e-5
2.02.072722.066692.06669+6e-3
3.09.057889.006769.00676+5.1e-2

NextStat and pyhf agree to 1e-7. ROOT overestimates q(μ) by up to 0.051 at μ = 3.0 (0.6% relative), growing linearly with distance from μ̂.

Diagnosis: The NLL offset between ROOT and NextStat is constant (11.06 ± 0.001), confirming identical model evaluation. The growing q(μ) delta is purely an optimizer effect: Minuit2's conditional minimizer converges to a slightly higher NLL at extreme μ values. ROOT would report a tighter exclusion limit than warranted by the data.

xmlimport (OverallSys): ROOT bias in q(μ)
Fig. 2 — xmlimport (OverallSys): ROOT's q(μ) overestimation grows with distance from μ̂.

3.3 The Ugly: Coupled HistoSys — ROOT Fails Completely

The coupled_histosys fixture produces the most dramatic divergence:

μROOT q(μ)pyhf q(μ)NextStat q(μ)ROOT−NS
0.50.0570.0570.057~0
1.00.9910.4450.445+0.545
2.015.5266.5436.543+8.98
3.041.56619.04219.042+22.52
coupled_histosys: q(μ) profile scan
Fig. 3 — coupled_histosys: ROOT diverges catastrophically at μ ≥ 1. pyhf and NextStat agree.

ROOT's free fit reports status = -1 (Minuit2 could not determine a positive-definite covariance matrix). The conditional fits at μ ≥ 1.0 diverge catastrophically.

Critical evidence: the NLL offset is not constant. If this were a pure optimizer issue, the offset between ROOT and NextStat would be constant across all μ values. We observe it growing from 420.74 to 432.00 — ruling out a pure optimizer difference and indicating ROOT evaluates the coupled HistoSys likelihood differently at large nuisance parameter values.

Eval pointROOT NLLNextStat NLLOffset
Free fit (μ̂)434.75414.017420.74
μ = 0434.84114.103420.74
μ = 1435.25014.239421.01
μ = 2442.51717.288425.23
μ = 3455.53723.537432.00
coupled_histosys: NLL offset (ROOT − NextStat)
Fig. 4 — NLL offset grows from 420.74 to 432.00: model-level discrepancy, not optimizer.

3.4 Real-World Analyses

AnalysisParamsmax Δq NS vs pyhfmax Δq NS vs ROOTROOT issue
simple fixture~5< 1e-81.6e-10None
EWK (HEPData)medium< 1e-50.0μ̂ = 4.9e23
tttt-prod249< 1e-50.04Tail convergence

The EWK analysis is striking: ROOT's unconditional fit diverged so severely that Minuit2 returned μ̂ = 4.9e23 with NLL = 1.8e23. NextStat converges normally to μ̂ = 6.57. Since μ̂ > all scan points, all q̃ values happen to be zero — masking the catastrophic failure.

4.Optimizer Quality on Large Models

Beyond the 3-way comparison, we investigate optimizer behavior on models with >100 nuisance parameters. NextStat (L-BFGS-B) and pyhf (SLSQP) are compared via cross-evaluation: each tool evaluates its objective at the other's best-fit parameters.

4.1 Cross-Evaluation Results

ModelParamsΔNLL (NS−pyhf)‖grad‖ pyhf‖grad‖ NS
simple30.0< 1e-6< 1e-6
complex80.0< 1e-6< 1e-6
tHu184-0.0814.630.020
tttt249-0.0101.440.008

On 184+ parameter models, NextStat finds NLL values 0.01–0.08 lower. Cross-evaluation confirms objective parity at ~1e-13: the difference is purely optimizer quality.

4.2 Gradient Norm Analysis

The projected gradient norm at the best-fit point is a direct measure of optimizer quality. At pyhf's optimum on the tHu workspace: ‖grad‖ = 4.63 (not stationary). At NextStat's optimum: ‖grad‖ = 0.020 (stationary to numerical precision).

4.3 Can pyhf Recover the Better Minimum?

When warm-started at NextStat's parameters, pyhf recovers NLL = 179.404 — confirming the minimum is genuine. However, pyhf's multi-start search (20 random inits) only converges in 5/20 runs, with the best result still 0.004 above NextStat's minimum.

4.4 Technical Explanation

L-BFGS-B maintains a limited-memory inverse Hessian approximation using the most recent m = 10 (s, y) pairs — O(mn) per iteration with effective curvature information. SLSQP can struggle on high-dimensional HistFactory likelihoods with tight bounds and correlated nuisance parameters, where the NLL surface contains narrow valleys.

5.Performance

Profile scan timing (31 μ-points, including free fit):

FixtureROOTpyhfNextStatvs ROOTvs pyhf
xmlimport0.91 s0.23 s0.003 s303×73×
multichannel1.98 s0.26 s0.007 s283×37×
coupled_histosys1.76 s0.15 s0.002 s880×75×

Speed comes from: compiled Rust (vs Python overhead), reverse-mode AD (vs finite-difference gradients), and zero-allocation hot path with pre-compiled modifier evaluation. CPU: Apple M5 (arm64). GPU toy benchmark: NVIDIA RTX 4000, additional 6.4× over CPU.

6.Validation Methodology

6.1 Seven-Tier Tolerance Contract

TierMetricToleranceMeaning
1Per-bin expected data1e-12Pure f64 arithmetic
2Full expected vector1e-8Summation order
3NLL valueatol=1e-8, rtol=1e-6Reduction noise
4Gradientatol=1e-6, rtol=1e-4AD vs FD
5Best-fit parameters2e-4Flat near minimum
6Uncertainties5e-4Hessian sensitivity
7Toy ensemble stats0.03–0.05MC noise

All tiers are CI-gated against pyhf's NumPy backend.

6.2 Why pyhf is the Reference, Not ROOT

  • Published in JOSS [5] with peer review
  • Used for published ATLAS results
  • Maintained by ATLAS physicists (Heinrich, Feickert, Stark)
  • Origin of the JSON workspace interchange format
  • Tested against ROOT's output in its own CI

The HistFactory specification is defined by the mathematical model [1], not by any software implementation. Validating against pyhf IS validating against the specification.

6.3 Reproducibility

bash
# 3-way profile scan comparison (ROOT + NextStat + optional pyhf scan)
PYTHONPATH=bindings/ns-py/python ./.venv/bin/python tests/validate_root_profile_scan.py \
  --histfactory-xml tests/fixtures/pyhf_xmlimport/config/example.xml \
  --rootdir tests/fixtures/pyhf_xmlimport \
  --include-pyhf --keep

# Cross-evaluation / optimizer diagnostics
PYTHONPATH=bindings/ns-py/python ./.venv/bin/python tests/diagnose_optimizer.py \
  --workspace tests/fixtures/workspace_tHu.json \
  --workspace tests/fixtures/tttt-prod_workspace.json \
  --multi-start 20

# Regenerate figures (reads snapshot data)
python3 scripts/blog/generate_numerical_accuracy_plots.py

7.Discussion

7.1 Implications for Published Results

  • OverallSys bias: 0.6% overestimation at μ = 3.0 translates to a slightly tighter exclusion limit. For most analyses negligible, but it is a systematic bias growing with distance from μ̂.
  • Coupled HistoSys failure: q(μ) = 41.6 vs 19.0 at μ = 3.0. In asymptotic approximation: √41.6 − √19.0 = 6.4σ − 4.4σ = 2.0σ overestimate.

This affects only analyses using coupled HistoSys modifiers at extreme parameter values. The majority of analyses use OverallSys, where ROOT's bias is small.

7.2 The Optimizer Matters

Cross-evaluation (§4) demonstrates unambiguously that the NLL landscape is identical — all differences arise from the optimizer. For <50 parameters, all three optimizers converge to the same minimum. For larger models, the choice of optimizer is a physics-relevant decision.

7.3 What NextStat Does Differently

  • Exact gradients: Reverse-mode AD provides machine-precision gradients, eliminating O(h²) truncation error of finite differences.
  • Zero-allocation optimization: NLL hot path allocates zero heap memory, reducing cache pressure.
  • Specification-first validation: Every CI run gates against pyhf, not ROOT. If ROOT disagrees, we investigate — but we don't break CI to match ROOT's bugs.

7.4 Practical Safeguards

  • Treat non-zero fit status / failed covariance as a hard failure
  • Cross-check with an independent implementation (pyhf) or cross-evaluate NLL at best-fit points
  • Track convergence metric (projected gradient norm / KKT residuals) on high-dimensional models
  • Warm-start conditional fits along μ scan from neighboring points
  • Record tool versions and fit tolerances with every published limit

8.Conclusion

We have demonstrated through systematic comparison that:

  • NextStat and pyhf agree to < 1e-5 on q(μ) across all canonical fixtures and real-world analyses with up to 249 parameters.
  • ROOT/RooFit exhibits three failure modes: optimizer non-convergence (coupled HistoSys), catastrophic divergence (μ̂ = 4.9e23), and systematic q(μ) overestimation (OverallSys, up to 0.6%).
  • On large models (>100 params), L-BFGS-B finds NLL values 0.01–0.08 lower than SLSQP, verified at ~1e-13 objective parity.
  • NextStat is 37×–880× faster than ROOT and 37×–75× faster than pyhf, with additional 6.4× GPU acceleration.

ROOT is the standard for I/O and visualization, but for statistical inference, independent implementations with modern optimizers and rigorous validation against the mathematical specification provide more reliable results.

Versions used for this post

ROOT6.38.00 (hist2workspace + RooFit/Minuit2)
pyhf0.7.6 (SciPy 1.17.0, NumPy 2.4.2)
NextStat0.9.0 (git c186c29, 2026-02-10)
CPUApple M5 (arm64), macOS 26.2
GPU (toys)NVIDIA RTX 4000

Timing figures measured on M5 CPU. GPU toy speedup on RTX 4000. Raw data snapshotted under docs/blog/assets/numerical-accuracy/data/.

References

  1. K. Cranmer et al. "HistFactory: A tool for creating statistical models for use with RooFit and RooStats." CERN-OPEN-2012-016, 2012.
  2. G. Cowan et al. "Asymptotic formulae for likelihood-based tests of new physics." Eur. Phys. J. C 71 (2011) 1554. arXiv:1007.1727.
  3. R. Brun, F. Rademakers. "ROOT — An object oriented data analysis framework." Nucl. Instrum. Meth. A 389 (1997) 81-86.
  4. F. James, M. Roos. "Minuit — a system for function minimization." Comput. Phys. Commun. 10 (1975) 343-367.
  5. L. Heinrich et al. "pyhf: pure-Python implementation of HistFactory statistical models." JOSS 6 (2021) 2823. doi:10.21105/joss.02823.

Validation Scripts

Profile scan: tests/validate_root_profile_scan.py

Optimizer diagnostics: tests/diagnose_optimizer.py

Plot generator: scripts/blog/generate_numerical_accuracy_plots.py

Snapshot data: docs/blog/assets/numerical-accuracy/data/