How NextStat Makes the HistFactory Pipeline Differentiable in PyTorch

Four-layer architecture: SoftHistogram → fused CUDA kernel → Rust GPU sessions → envelope theorem. Train neural networks directly on discovery significance Z₀ with full systematic uncertainties. For the usage/API surface (not the derivations), see the canonical spec.

2026-01-20/18 min read/Technical deep-dive

PyTorchCUDADifferentiableHistFactoryGPUEnvelope TheoremZero-Copy

Trust Offensive series:Index·Prev: Numerical Accuracy·Next: Bayesian Benchmarks

Abstract. HEP analyses care about discovery significance Z₀, not classifier accuracy. The standard pipeline (NN → histogram → HistFactory → profiled fit → q₀ → Z₀) is non-differentiable: binning is discrete and profiling introduces an argmin. NextStat solves both with a differentiable histogram, a fused GPU kernel for NLL + analytical gradients, Rust GPU sessions, and the envelope theorem for exact profiled gradients — yielding a PyTorch-native loss that trains networks directly on physics significance. For the usage/API surface (without derivations), see the canonical docs.

1.The problem: the metric lives after inference

A neural network produces a score per event, binned into a histogram that becomes the sufficient statistic for a HistFactory likelihood. The discovery test statistic is:

q₀  =  2 · [ NLL(θ̂₀, μ=0) − NLL(θ̂, μ̂) ]₊
Z₀  =  √q₀

where μ is the signal strength (POI), θ are nuisance parameters, hats denote MLEs, and [·]₊ is one-sided discovery clipping.

This is what physics cares about. But the computational graph breaks differentiability in two places:

›Histogramming — hard bin edges yield zero gradients almost everywhere
›Profiling — θ̂(s) = argmin NLL(θ; s) is an optimization subproblem inside the forward pass

2.Notation and objective

Standard HistFactory binned likelihood notation: main bins indexed by i, observed counts n_i, model expected counts ν_i(μ,θ; s) as a sum over samples with modifiers. The signal histogram s ∈ ℝ₊ᴮ is the vector we differentiate with respect to.

NLL(μ, θ; s)  =  Σᵢ [ νᵢ − nᵢ·log(νᵢ) + log(nᵢ!) ]  +  constraints(θ)

q₀(s)  =  2 · [ NLL(μ=0, θ̂₀(s); s) − NLL(μ̂(s), θ̂(s); s) ]₊

Goal:  ∂Z₀/∂s  →  chain rule  →  ∂Z₀/∂w   (network weights)

One-sided discovery convention. The [·]₊ clipping makes q₀ non-differentiable at the boundary where it transitions to zero. NextStat follows the standard convention: if μ̂ < 0 or q₀ clamps to zero, the returned value is q₀ = 0 and the gradient is set to zero (a valid subgradient choice). For Z₀ = √q₀ a small ε is added inside the square root in the PyTorch wrapper to avoid division-by-zero when forming ∂Z₀/∂q₀.

3.Architecture at a glance

The end-to-end stack that turns a network output into a differentiable Z₀ objective:

Layer 1PyTorch

SoftHistogram

Differentiable binning (KDE / sigmoid)

▼

Layer 2CUDA / Metal

Fused GPU Kernel

NLL + analytical ∂NLL/∂s in one launch

▼

Layer 3ns-compute

Rust GPU Session

Model buffers resident on device

▼

Layer 4ns-inference

Envelope Theorem

Exact profiled ∂q₀/∂s without argmin AD

PyTorch-native loss → optimizer.step()

What happens when you call loss.backward()

Phase 1 — Fixed-parameter NLL (zero-copy)

1.nextstat.torch.nll_loss(...)Python

2.NextStatNLL.apply(signal, session, params)autograd

3._core.DifferentiableSession.nll_grad_signal(...)PyO3

4.ns_inference::DifferentiableSession::nll_grad_signal()Rust

5.DifferentiableAccelerator::nll_grad_wrt_signal()Rust

6.differentiable_nll_grad ← kernel launchCUDA

Phase 2 — Profiled q₀ (envelope gradient)

1.nextstat.torch.profiled_q0_loss(...)Python

2.ProfiledQ0Loss.apply(signal, session)autograd

3._core.ProfiledDifferentiableSession.profiled_q0_and_grad()PyO3

4.2× profiled L-BFGS-B fits (CPU orchestrated)Rust

5.nll_and_grad_params → fused kernel (per iteration)CUDA

6.2× nll_grad_all at optima → ∂NLL/∂signalCUDA

7.Envelope: ∂q₀/∂s = 2·(∂NLL/∂s|₀ − ∂NLL/∂s|free)Rust

8.Build CUDA tensor from gradient, chain rulePython

4.Layer 1 — SoftHistogram: making binning differentiable

HistFactory consumes binned yields, but a neural network emits continuous scores. A hard histogram h_j = Σ 𝟙{e_j ≤ x_k < e_{j+1}} is not differentiable w.r.t. x_k.

Gaussian KDE mode (default)

Each event contributes softly to all bins via a Gaussian kernel centered at the bin center:

w_{kj}  ∝  exp( −½ · ((x_k − c_j) / σ)² )
h_j    =  Σ_k  w̃_{kj}

Per-event normalization: Σ_j w̃_{kj} = 1

Sigmoid-edge mode (faster)

Pr(x ∈ [e_j, e_{j+1}))  ≈  σ((x−e_j)/τ) − σ((x−e_{j+1})/τ)

Cheaper and still differentiable, but typically less smooth than KDE. Both modes produce a differentiable tensor of bin counts → the signal yield vector s.

5.Layer 2 — the fused GPU kernel: NLL + analytical gradients

The core primitive: a fused kernel evaluating the constrained Poisson NLL and analytical gradients w.r.t. both nuisance parameters θ (for L-BFGS-B) and signal yields s (for backpropagation).

HistFactory expected yields

νᵢ(θ; s)  =  Σₐ (nom_{a,i} + Δ_{a,i}(θ)) · F_{a,i}(θ)

F = product of multiplicative modifiers (NormFactor/Lumi/NormSys/ShapeSys/…)
Δ = additive term (HistoSys)

The key gradient identity

For one bin i, define the Poisson NLL term and its derivative:

ℓᵢ(νᵢ)  =  νᵢ − nᵢ·log(νᵢ) + log(nᵢ!)

∂ℓᵢ/∂νᵢ  =  1 − nᵢ/νᵢ   ≡  wᵢ   ("weight")

∂NLL/∂sᵢ  =  wᵢ · F_{signal,i}  =  (1 − nᵢ/νᵢ) · F_{signal,i}

This is exactly what the kernel writes into the signal gradient buffer.

Zero-copy for PyTorch CUDA tensors

›Python passes signal.data_ptr() (raw CUDA device address) to Rust as a u64
›Rust passes the pointer into the kernel as g_external_signal
›Kernel reads signal bins directly from PyTorch GPU memory
›Kernel writes ∂NLL/∂signal via atomicAdd into an external gradient buffer

What is "zero-copy": The large, per-bin signal vector stays entirely on device. There are still small H↔D transfers for the nuisance parameter vector and the returned scalar NLL — tiny compared to moving the histogram.

CUDA

// Inner loop of differentiable_nll_grad.cu
double w = 1.0 - obs / expected_bin;
atomicAdd(&g_grad_signal_out[sig_local_bin], w * signal_factor);

Modifier coverage

The fused kernel implements all 7 HistFactory modifier types: NormFactor, Lumi, NormSys (code-4 polynomial/exponential), HistoSys (per-bin additive, code-4p), ShapeSys, ShapeFactor, StatError — plus gradients for auxiliary Poisson constraints (Barlow–Beeston γ parameters) and Gaussian constraints.

6.Layer 3 — Rust GPU sessions: keep the model on device

The kernel is called hundreds of times per forward pass (profiled objectives). NextStat serializes the HistFactory model once into flat GPU buffers:

›Nominal sample yields
›Sample descriptors (bin ranges, offsets)
›Modifier descriptors + offset tables + data (NormSys coefficients, HistoSys deltas)
›Constraint tables

DifferentiableAccelerator (CUDA)

crates/ns-compute/src/differentiable.rs loads the PTX at build time, uploads static model buffers once, allocates reusable device buffers, and launches the kernel with shared memory sized as params[n_params] + scratch[block_size].

The build script compiles the differentiable kernel without --use_fast_math to avoid gradient noise that can harm NN training stability.

Multi-channel signal support

s = [ s⁽⁰⁾ ‖ s⁽¹⁾ ‖ ⋯ ]

Each entry: (sample_idx, first_bin, n_bins)
Kernel maps main-bin index → correct offset in g_external_signal

7.Layer 4 — profiled significance + envelope theorem

Profiling introduces a nested optimization. Naively one might unroll optimizer iterations and backprop through them — expensive, memory-heavy, and dependent on optimizer choice. NextStat uses the envelope theorem.

Discovery statistic as two profiled optima

The ProfiledDifferentiableSession computes the free optimum (μ̂, θ̂) and the constrained optimum at μ=0 (θ̂₀), both via bounded L-BFGS-B on CPU with GPU kernel launches for NLL + ∂NLL/∂θ.

Envelope gradient for q₀

For V(s) = min_θ f(θ, s):

  dV/ds = ∂f/∂s |_{θ = θ̂(s)}

Proof sketch:  dV/ds = ∂f/∂s + (∂f/∂θ)·(dθ̂/ds)
                                    ↑
                                  = 0 at optimum

Applying to both terms in q₀:

  ∂q₀/∂s = 2 · ( ∂NLL/∂s |_{μ=0, θ=θ̂₀}  −  ∂NLL/∂s |_{μ=μ̂, θ=θ̂} )

In code, this is: two profile fits → evaluate ∂NLL/∂signal at each optimum → take the difference. The gradient is exact at convergence. If the optimizer fails to converge, NextStat returns an error rather than producing an incorrect gradient.

Bounded case. With box constraints (L-BFGS-B), the envelope conclusion holds under standard KKT constraint qualifications. The fitted parameters can be treated as constants when differentiating w.r.t. the external signal histogram.

8.PyTorch autograd: wiring gradients into backprop

Fixed-parameter NLL (Phase 1)

Forward: allocate zeroed grad_signal (atomicAdd requires zeros) → synchronize CUDA streams → call native session with raw device pointers → synchronize → store grad_signal. Backward is the chain rule: ∂L/∂s = (∂L/∂NLL) · (∂NLL/∂s).

SignificanceLoss — the user-facing API

python

from nextstat.torch import SoftHistogram, SignificanceLoss

loss_fn = SignificanceLoss(model, signal_sample_name="signal")
soft_hist = SoftHistogram(bin_edges, bandwidth="auto", mode="kde")

scores = classifier(batch)                 # [N]
signal_hist = soft_hist(scores).double()   # [B] float64 for CUDA

loss = loss_fn(signal_hist)               # returns −Z₀
loss.backward()                           # gradients flow to NN weights
optimizer.step()

Why the explicit torch.cuda.synchronize()?

PyTorch and NextStat launch kernels on different CUDA streams. Without a barrier, the NextStat kernel might read an incomplete signal tensor or PyTorch might read an incomplete gradient. Current implementation uses full synchronize() for correctness; a future optimization is stream-local event waits.

9.Experimental validation

For a differentiable inference layer, the gradient is the product. NextStat validates gradients in two ways:

›Unit/integration tests compare analytical gradients against central finite differences on GPU
›Benchmark fixtures report max absolute gradient error vs finite differences

Evidence pack entry point: Benchmark Results.

Gradient accuracy (NLL layer)

2.07 × 10⁻⁹

max |analytical − finite diff|

10.Metal backend (Apple Silicon)

Same algorithm, different constraints:

›Fused kernel runs in f32 (Apple GPU precision)
›Inputs/outputs converted at API boundary (PyTorch tensors remain f64)
›L-BFGS-B tolerance relaxed to 1e−3 (vs 1e−5 on CUDA)
›Signal histogram uploaded via CPU (no raw pointer interop with MPS tensors)

NLL parity vs CPU f64 reference: ~1e−6 relative level for typical workspaces.

11.Performance notes

The differentiable layer is engineered so that per-iteration NLL + gradients are computed in one fused GPU kernel launch, and CPU work is limited to L-BFGS-B state updates. Profiled objectives require two profile fits per forward pass, but operate on small histograms (O(10–1000) bins), and warm-started fits converge quickly.

12.Limitations and future work

›Derived inference quantities (μ_up, CLs bands) require differentiating through root-finders — envelope theorem covers q₀ and q_μ only
›CUDA stream interop can reduce synchronization overhead
›More aggressive kernel fusion — compute both profiled gradients inside Rust and write directly into PyTorch grad buffer

∎.Conclusion

NextStat turns a historically non-differentiable HEP inference pipeline into a PyTorch-friendly loss function by:

›Replacing hard binning with a differentiable histogram (SoftHistogram)
›Implementing a fused GPU kernel for HistFactory NLL + analytical gradients
›Keeping serialized model state resident on device via Rust sessions
›Using the envelope theorem for exact profiled gradients without AD through the optimizer

This enables training neural networks directly on physics significance — optimizing what the analysis ultimately cares about, not a surrogate proxy.

Versions used for this post

NextStat	0.9.0 (git `c186c29`, 2026-02-10)
PyTorch	2.6.0 (CUDA 12.6)
ROOT	6.34 (HS3 v0.2 reference)
pyhf	0.7.6 (parity baseline)
SciPy	1.15.1 (L-BFGS-B reference)
Rust	1.93.0
CUDA toolkit	12.6 / driver 565.57
GPU (CUDA)	NVIDIA A100 80 GB
GPU (Metal)	Apple M3 Max (40-core GPU, 48 GB unified)

Gradient accuracy figures (2.07e−9) were measured on the A100 with the benchmark workspace from Benchmark Results. Metal parity figures (~1e−6 relative) were measured on the M3 Max.

References

G. Cowan, K. Cranmer, E. Gross, O. Vitells. Asymptotic formulae for likelihood-based tests of new physics. Eur. Phys. J. C 71, 1554 (2011).
L. Heinrich et al. pyhf: a pure-Python implementation of HistFactory statistical models. Journal of Open Source Software (JOSS).
R. H. Byrd, P. Lu, J. Nocedal, C. Zhu. A Limited Memory Algorithm for Bound Constrained Optimization. SIAM J. Sci. Comput. 16(5), 1190–1208 (1995). (L-BFGS-B.)
J. M. Danskin. The theory of max-min, with applications. SIAM J. Appl. Math. 14(4) (1966). (Value-function differentiation / Danskin's theorem.)
R. T. Rockafellar, R. J.-B. Wets. Variational Analysis. Springer (1998). (General theory: envelope/value-function differentiation, constrained optima.)

Implementation

CUDA kernel: crates/ns-compute/kernels/differentiable_nll_grad.cu

CUDA session: crates/ns-compute/src/differentiable.rs

Envelope gradient: crates/ns-inference/src/differentiable.rs

PyTorch layer: bindings/ns-py/python/nextstat/torch.py

Differentiable Analysis Docs →ML Training Guide →GPU Acceleration →