Building a Trustworthy HEP Benchmark Harness

pyhf + ROOT/RooFit — correctness gates, warm-start policies, pinned environments, and auditable artifacts.

HistFactorypyhfROOTBenchmarksCorrectness

2026-02-05 · 10 min read

Trust Offensive series:Index·Prev: Third-Party Replication·Next: Numerical Accuracy

Benchmarks are easy to get wrong even when nobody is trying to cheat. In HEP, the most dangerous failure mode is simple:

You can "win" by benchmarking the wrong inference.

If two implementations disagree on the likelihood model (interpolation codes, constraints, masking, parameterization), then timing comparisons are meaningless. You are measuring different computations.

This post explains how we build a HEP benchmark harness that treats performance like a scientific claim: correctness gates first, explicit protocols, pinned environments, and artifacts that other people can rerun.

›Public Benchmarks Specification — global benchmark program contract
›Validation Report Artifacts — unified JSON+PDF evidence pack for published snapshots

Abstract. We benchmark workflow-level HistFactory inference tasks (NLL, gradients, fits, scans, toys) across pyhf as the primary reference implementation and (selectively) ROOT/RooFit/RooStats when the workflow can be automated reproducibly. Speed numbers are only emitted after correctness gates pass, and each run produces auditable artifacts.

1.Threat model: how HEP benchmarks lie

The point of the harness is not to produce impressive numbers. It is to prevent these common failure modes.

1.1 Model mismatch (not the same likelihood)

Two implementations can disagree "silently" if you don't pin model conventions:

›Interpolation codes (code4 vs alternatives)
›Constraints and priors
›Masking conventions when nᵢ=0 or νᵢ→0
›Parameter naming/order (especially per-bin modifiers)
›One-sided vs two-sided conventions for test statistics

If the model is not identical, timing comparisons do not mean anything.

1.2 Optimizer mismatch ("faster" because it stopped early)

For fits and scans, you can appear faster by using a looser tolerance, hitting bounds and calling it convergence, or returning a suboptimal point. So the harness treats convergence metadata and cross-evaluation as first-class outputs.

1.3 Warm-start mismatch (especially for profile scans)

A profile scan benchmark is mostly a benchmark of your warm-start policy: cold-start each μ point from a fixed init, vs warm-start from the previous scan point (the standard analyst workflow). If you don't publish the policy, you don't have a benchmark.

1.4 Environment drift ("works on my machine" performance)

Benchmarks move when compiler/toolchain versions, BLAS/GPU stack, Python dependency versions, or CPU/GPU model change. So every publishable run captures an environment manifest and dataset hashes.

1.5 Reporting bias (single-number theater)

Single numbers hide variance and measurement choices. A trustworthy run publishes raw per-repeat timings, an explicit aggregation policy, and the inputs and settings used.

2.What we benchmark (workflow-first)

We benchmark end-to-end HEP workflows that dominate wall-time:

›NLL evaluation at a fixed parameter point (core building block)
›Gradients (required for optimizers and HMC/NUTS-style methods)
›MLE fits (unconditional and conditional)
›Profile likelihood scans (warm-start policy must be explicit)
›Toy ensembles (toys/sec, scaling with parameter count)

3.Correctness gates: fail fast before timing

Every suite run must include correctness gating before it prints timings. Our pyhf-vs-NextStat harness (in-repo):

›Loads a pyhf workspace
›Builds the pyhf model with explicit interpolation settings (code4, code4p)
›Maps parameters by name, not by index
›Evaluates NLL in both implementations
›Fails fast if the NLLs disagree beyond tolerance

Only after that does it print performance numbers.

Reference script: tests/benchmark_pyhf_vs_nextstat.py.

For evidence-grade publication, a snapshot also ships a validation pack and a machine-readable inventory:

›validation_report.json (schema validation_report_v1)
›snapshot_index.json (schema nextstat.snapshot_index.v1)

Validation pack entry point: /docs/validation-report.

nextstat validation-report \
  --apex2 tmp/apex2_master_report.json \
  --workspace workspace.json \
  --out validation_report.json \
  --pdf validation_report.pdf \
  --deterministic

4.Parameter mapping and model settings

Parameter mapping by name

Different implementations may order parameters differently (especially with per-bin modifiers like ShapeSys gammas). Our harness maps parameter vectors by parameter name, not by index.

Explicit interpolation codes

HistFactory has multiple interpolation conventions. Benchmarks must pin NormSys and HistoSys interpolation codes. Otherwise you're not benchmarking the same statistical model.

5.Profile scans: cold-start vs warm-start is the whole story

Profile scans are a classic place where naive benchmarks lie. The harness must publish: the POI grid, the warm-start policy, bounds and tolerances, and any clipping conventions (one-sided qμ/q₀).

6.ROOT/RooFit comparisons: what we will (and won't) claim

For ROOT comparisons, we will:

›Publish failure modes and fit-status rates, not only averages
›Publish cross-evaluation checks (evaluate NLL from implementation A at params from B)
›Keep "did it converge?" as a first-class metric

See also: Numerical Accuracy — ROOT vs pyhf vs NextStat, with reproducible evidence.

7.The punchline: rerun me, don't trust me

Our end state is not "we have good numbers". Our end state is: you can rerun the harness on your machine, see the same correctness gates, and compare results with full context. That's how performance becomes evidence.