Compiler-Symbolic vs Hybrid-Neural Approach:
GPU-Accelerated Unbinned Fits in HEP
MoreFit (Langenbruch, EPJC 86, 2026) showed that symbolic differentiation + JIT-compiled computation graphs yield ~10× speedup over RooFit CUDA for "clean" unbinned fits. NextStat takes a different path — ONNX-backed normalizing flows (optional, feature neural) + fused CUDA kernels with analytical gradients + yield systematics (rate modifiers) from day one. This article dissects both approaches mathematically, compares architectures, and demonstrates why production physics analyses demand a hybrid pipeline.
February 2026 · ~20 min read
1.Extended Unbinned Likelihood: Mathematical Formulation
Unbinned (event-level) analysis constructs the likelihood directly from observed events{xi}i=1…N, preserving all information that binning would discard. The extended likelihood for a model with P processes:
ℒ(θ) = Pois(N | ν_tot(θ)) · ∏ᵢ₌₁ᴺ f(xᵢ | θ) where: ν_tot(θ) = Σₚ νₚ(θ) — total expected yield f(xᵢ | θ) = Σₚ νₚ(θ) · pₚ(xᵢ | θ) / ν_tot(θ) — weighted mixture PDF Negative log-likelihood: NLL(θ) = ν_tot(θ) − Σᵢ log[ Σₚ νₚ(θ) · pₚ(xᵢ | θ) ] + C_constr(θ) C_constr(θ) = ½ Σⱼ ((θⱼ − μⱼ) / σⱼ)² — Gaussian constraints
The computational bottleneck is the inner sum over processes for each of N events. ForN = 50,000–500,000 (typical unbinned analyses at LHCb/Belle II) this dominates all other costs. GPUs are ideal: each event is an independent unit of work.
Minimizing NLL(θ) requires the gradient ∇θNLL. This is where the two approaches fundamentally diverge.
2.The MoreFit Approach: Compiler Optimizations on the Computation Graph
MoreFit (Langenbruch, EPJC 86, 2026) proposes a compiler-based approach to unbinned fits. The core idea: users describe PDFs as a symbolic computation graph, the framework optimizes the graph, then JIT-compiles it into a GPU kernel (OpenCL) or CPU code (LLVM/Clang with SIMD).
2.1 Computation Graph and Optimizations
The PDF is expressed as a DAG (directed acyclic graph) where nodes are arithmetic operations and leaves are parameters θ and observables x. Three key optimizations are applied:
- ›Parameter-only term hoisting. All subgraphs depending only on θ (no x) are evaluated once on CPU and passed to the GPU kernel as constants. For normalized PDFs, this includes normalization integrals — they depend on (μ, σ) but not on x. Result: the GPU kernel computes only the unnormalized PDF; division by the norm uses a pre-computed constant.
- ›CSE (Common Subexpression Elimination). If (x − μ) appears in two places in the graph, it is computed once. For Crystal Ball PDFs, where (x − μ)/σ is used in both the Gaussian core and the power-law tail, this yields 20–30% fewer arithmetic operations.
- ›Symbolic differentiation. Partial derivatives ∂NLL/∂θⱼ are computed symbolically from the graph. This gives analytical gradients rather than numerical (finite differences) or AD (reverse-mode tape). The gradient is integrated into the same GPU kernel — one pass over the data computes both NLL and ∇NLL.
2.2 Result: Convergence in 2–3 Iterations
Numerical derivatives (central finite differences, h = 10⁻⁴) require 2P+1 function evaluations per optimizer iteration. For a model with P = 10 parameters — 21 NLL calls. Symbolic gradients deliver exact derivatives in a single call. With a quasi-Newton optimizer (L-BFGS), MoreFit demonstrates convergence in 2–3 iterations instead of ~85 function calls in RooFit CUDA with numerical derivatives.
Minimization cost: MoreFit: K_sym × C_kernel where K_sym ≈ 2–3 (L-BFGS iterations with exact gradient) RooFit: K_num × (2P+1) × C_kernel where K_num ≈ 15–30, P ≈ 10 Speedup from gradients: K_num × (2P+1) / K_sym ≈ 30 × 21 / 3 ≈ 210× Speedup from kernel (CSE + hoisting): ≈ 2–5× Total: ~10× over RooFit CUDA (author benchmarks)
2.3 Architectural Limitations
The compiler approach is optimal for a closed set of analytical PDFs. However:
- ›Each new PDF family requires writing the formula, proving gradient correctness, testing.
- ›Systematic uncertainties (nuisance parameters, rate modifiers) are deferred to "future versions."
- ›Acceptance corrections and per-event weights/selection filters are not specified as part of the v0.1 contract.
- ›Binned fits (HistFactory) are out of scope.
- ›In multi-dimensional cases (>3D), analytical PDFs scale poorly — no closed forms for arbitrary correlations.
3.The NextStat Approach: Hybrid Parametric-Neural Pipeline
NextStat solves the same NLL(θ) minimization problem, but with fundamentally different architectural priorities: systematics, arbitrary PDFs, production-readiness. This is achieved through a three-tier gradient system and two GPU acceleration paths.
3.1 Three-Tier Gradient System
Tier 1: Reverse-mode AD (ns-ad crate, Tape)
├ Computation graph: Add, Sub, Mul, Div, Ln, Exp, Powf, Max
├ O(1) backward pass in number of parameters
├ Used for: general models, HistFactory, profile likelihood
└ CPU only (Rust, no GPU)
Tier 2: Analytical CUDA/Metal kernels
├ Hand-derived ∂logp/∂θ for each PDF (Gaussian, Exponential,
│ Crystal Ball, Double CB, Chebyshev, Histogram)
├ Fused NLL+grad kernel: 1 pass over data = NLL + ∇NLL
├ Grid-stride loop, shared memory, atomic grad accumulation
├ 1 CUDA block = 1 dataset (batch toy: 1 block = 1 toy)
└ Lockstep L-BFGS-B: all toys at same iteration → 1 kernel launch
Tier 3: Central finite differences (h = 10⁻⁴)
├ For flow PDFs (ONNX = black box, no graph access)
├ 2P+1 calls to eval_logp per iteration
├ NLL reduction on GPU (CudaFlowNllAccelerator)
└ Trade-off: flexibility > speed for flow componentsKey insight: NextStat does not choose a single approach — it combines all three within a single model. Parametric components receive analytical gradients (Tier 2), flow components get finite differences (Tier 3), and the HistFactory wrapper uses reverse-mode AD (Tier 1).
3.2 Fused CUDA Kernel for Parametric PDFs
For parametric PDFs, NextStat uses the same principle as MoreFit — analytical derivatives fused into a single kernel. The unbinned_nll_grad.cu kernel (~3,150 lines of CUDA C) includes:
- ›6 PDF families with analytical gradients: Truncated Gaussian, Bounded Exponential, Crystal Ball, Double Crystal Ball, Chebyshev (arbitrary order), Histogram (bin lookup).
- ›Rate modifiers: NormSys (log-normal), WeightSys (Code0 + Code4p polynomial interpolation) — yield systematics integrated directly into the gradient kernel.
- ›Online logsumexp over processes for numerically stable mixture likelihood.
- ›Batch toy architecture: 1 CUDA block = 1 toy dataset. All toys execute in parallel within a single kernel launch. Lockstep L-BFGS-B: all toys at the same optimizer iteration.
// Kernel architecture (simplified):
// 1 block = 1 dataset, threads = grid-stride loop over events
extern "C" __global__ void unbinned_batch_nll_grad(
const double* params_flat, // [n_toys × n_params]
const double* obs_flat, // [total_events]
const uint* toy_offsets, // [n_toys + 1]
/* ... model descriptors ... */
double* nll_out, // [n_toys]
double* grad_out, // [n_toys × n_params]
uint n_params, uint n_procs, uint n_toys
) {
uint toy_idx = blockIdx.x;
// Load params[toy_idx] → shared memory
// Grid-stride loop over events[toy_offsets[toy_idx]..toy_offsets[toy_idx+1]]
// For each event:
// logsumexp over processes: log Σₚ νₚ·pₚ(x|θ)
// Accumulate ∂NLL/∂θⱼ via atomicAdd (analytical)
// Add Gaussian constraints + yield term
}3.3 GPU Flow Session: NLL Reduction for Neural PDFs
For flow PDFs (normalizing flows exported via ONNX), NextStat splits computation into two stages:
Path 1: CPU Flow + GPU Reduce (Phase 3 MVP)
┌──────────────────┐ ┌─────────────────────┐
│ FlowPdf::log_prob│────▶│ H2D upload logp_flat │
│ (CPU / ONNX) │ │ [n_procs × N] │
└──────────────────┘ └──────────┬────────────┘
▼
┌─────────────────────┐
│ CudaFlowNllAccelerator│
│ logsumexp + yields │
│ + constraints │
│ → NLL (scalar) │
└─────────────────────┘
Path 2: CUDA EP Flow + GPU Reduce (D4, zero-copy)
┌──────────────────┐ ┌─────────────────────┐
│ ONNX Runtime │────▶│ CudaSlice<f64> │
│ CUDA EP │ │ (device-resident) │
└──────────────────┘ └──────────┬────────────┘
▼
┌─────────────────────┐
│ nll_device() │
│ (zero-copy path) │
└─────────────────────┘Gradient for flows: central finite differences, h = 10⁻⁴. Each optimizer iteration requires 2P+1 calls to eval_logp. Slower than analytical derivatives, but enables using any PDF exported as ONNX.
Implementation detail (contract): NextStat GPU kernels are compiled to PTX at build time (nvcc --ptx, see crates/ns-compute/build.rs). The flow “device-resident” path means the reduction can read a GPU logp buffer directly (CudaFlowNllAccelerator::nll_device), while yields/params remain small host→device uploads. This is a fundamentally different JIT class than MoreFit’s DAG→kernel compilation.
4.Gradient Strategies: Mathematical Comparison
Let NLL(θ) = L(θ). We need ∇θL for P parameters.
4.1 Symbolic Differentiation (MoreFit)
For each parameter θⱼ:
∂L/∂θⱼ = ∂ν_tot/∂θⱼ − Σᵢ [ Σₚ (∂νₚ/∂θⱼ · pₚ(xᵢ) + νₚ · ∂pₚ/∂θⱼ) ] / [ Σₚ νₚ · pₚ(xᵢ) ]
+ Σⱼ (θⱼ − μⱼ) / σⱼ²
All derivatives ∂pₚ/∂θⱼ computed symbolically from the graph. Result:
— Precision: machine epsilon (~10⁻¹⁶)
— Cost: 1 pass over data (fused with NLL)
— L-BFGS convergence: 2–3 iterations
— Limitation: only for PDFs with known analytical form4.2 Analytical CUDA Kernels (NextStat Tier 2)
Same principle, but hand-implemented in CUDA C rather than compiler-generated. Example: Truncated Gaussian on [a, b] log p(x | μ, σ) = −½((x−μ)/σ)² − log σ − log Z Z = Φ((b−μ)/σ) − Φ((a−μ)/σ) ∂log p/∂μ = (x−μ)/σ² − (φ(zₐ) − φ(z_b)) / (σ · Z) ∂log p/∂σ = ((x−μ)² − σ²) / σ³ − (zₐ·φ(zₐ) − z_b·φ(z_b)) / (σ · Z) where zₐ = (a−μ)/σ, z_b = (b−μ)/σ, φ = standard normal PDF, Φ = CDF Similarly for Crystal Ball (5 params: μ, σ, α, n + normalization), Double Crystal Ball (7 params), Chebyshev (arbitrary order). Cost: identical to MoreFit for parametric PDFs Advantage: rate modifiers (NormSys, WeightSys) integrated in kernel
4.3 Reverse-mode AD (NextStat Tier 1, ns-ad Tape)
Records computation graph (forward pass), then single backward sweep: tape.var(θ₁), tape.var(θ₂), ... z = tape.mul(x, y) // records z = x·y and ∂z/∂x = y, ∂z/∂y = x w = tape.add(z, x) // records w = z+x and ∂w/∂z = 1, ∂w/∂x = 1 tape.backward(w) // backward pass: O(|graph|) tape.adjoint(θⱼ) // dw/dθⱼ Backward cost: O(|graph|) ≈ O(N·P) — single backward pass Compared to finite differences: O(N·P) vs O(N·P²) Limitation: CPU only, no GPU version (yet) Used for: HistFactory binned models, profile likelihood
4.4 Finite Differences (NextStat Tier 3)
Central scheme: ∂L/∂θⱼ ≈ (L(θ + h·eⱼ) − L(θ − h·eⱼ)) / (2h) h = 10⁻⁴ Cost: 2P+1 evaluations of L(θ) per iteration Precision: O(h²) ≈ 10⁻⁸ (sufficient for L-BFGS, not for Hessian) Used for: flow PDFs (ONNX = black box) Compensation: NLL reduction on GPU, flow eval on GPU (CUDA EP) → bottleneck: flow inference, not reduction
5.Feature Comparison
| Feature | MoreFit v0.1 | NextStat |
|---|---|---|
| Analytical gradients | Symbolic from graph | Hand-derived CUDA + reverse-mode AD |
| GPU backend | OpenCL (vendor-agnostic) | CUDA + Metal (Apple Silicon) |
| JIT kernel compilation | Yes (graph → kernel) | No (kernels compile to PTX at build time; driver may JIT PTX→SASS on newer SMs) |
| Parameter hoisting | Automatic (from graph) | Explicit: closed-form normalization for built-in PDFs + hand-managed constants; no general DAG optimizer |
| CSE | Automatic | Manual (in CUDA C code) |
| Systematics (NormSys) | No (deferred) | Yes, in GPU kernel |
| Systematics (WeightSys) | No (deferred) | Yes (Code0 + Code4p interp) |
| Conditional flows | No | p(x|α): α as context vector |
| Selection / weights | Limited | Selection (ns-root expr). Observed frequency weights supported (ROOT data.weight or Parquet _weight; finite, ≥0, sum(w)>0) |
| Binned fits | No | Full HistFactory + GPU |
| Arbitrary PDFs | Built-in library only | Train flow → ONNX → drop-in |
| Template morphing | No | DCR surrogate (neural) |
| Batch toy fits | Not described | Lockstep L-BFGS-B, 1 kernel |
| Device-resident toys | Not described | Toy sampling → batch fit without H2D obs_flat (H2D only toy_offsets) |
| Validation | Speed benchmarks | GPU↔CPU parity tests (cuda/metal), CLI contract tests, toy-fit guardrails |
| Multi-channel | Not described | Sum NLL/grad across channels |
6.Architectural Justification: Why ONNX over Symbolic JIT
6.1 The Extensibility Problem
MoreFit: each new PDF family → formula → symbolic derivatives → correctness test → integration into codegen. For Crystal Ball this is ~200 lines, for Double Crystal Ball — ~400. For arbitrary multi-dimensional correlation structures — impossible.
NextStat: train a normalizing flow on Monte Carlo data → torch.onnx.export() → file. No new GPU code needed. The flow automatically captures correlations of any dimensionality.
6.2 The Systematics Problem in Unbinned
In real analyses, PDFs depend on nuisance parameters α: p(x | θ, α). MoreFit has no mechanism for morphing PDFs along α.
NextStat: conditional normalizing flow accepts α as a context vector. A single flow is trained on samples at different α values. At fit time, the flow computes log p(x | θ, α) for arbitrary α values — smooth morphing across the entire systematics space without discrete templates.
6.3 The High-Dimensionality Problem
Analytical PDFs scale poorly beyond 3D: no closed forms for complex correlation structures. Normalizing flows operate in arbitrary dimension D and automatically model correlations through training. For analyses with multiple observables (minv, cos θ, φ, Δη, ...) this is a critical advantage.
7.NextStat R&D: Work in Progress
7.1 Device-Resident Toy Pipeline
Current implementation (Phase 1, 1D): sampling on GPU → toy events remain in device memory → batch fit without H2D copy of the main event buffer. Contract: toy offsets are computed on host and uploaded as a small buffer ((n_toys+1)×4 bytes), while the large obs_flat stays device-resident. API: fit_unbinned_toys_batch_cuda_device(). Toy offsets (~4 bytes × n_toys) are the only host→device transfer.
Note: the current GPU toy sampler is deliberately gated to analytic 1D PDFs (Gaussian/Exponential) to avoid silent “midpoint sampling” fallbacks and preserve correctness guarantees.
7.2 Multi-Channel Batch Fits
Models with multiple channels (categories): NLL = Σch NLLch. Each channel has its own set of processes but shared parameters. Implemented via CudaUnbinnedBatchAccelerator per channel, with NLL/grad summation on host. Gaussian constraints attached to a single channel to avoid double-counting.
7.3 Metal Backend (Apple Silicon)
Fully parallel implementation in Metal Shading Language for M1/M2/M3. Same lockstep L-BFGS-B, same analytical gradients, f32 (Metal lacks native f64). The accuracy contract is fixed by parity tests: Metal (f32, atomic accumulation) is compared against CPU with relaxed tolerances on the order of 1e−2 abs and ~1e−6 rel (see crates/ns-compute/tests/unbinned_gpu_parity.rs).
7.4 Reverse-mode AD on GPU (Research)
The current ns-ad::Tape runs on CPU. Exploring a GPU version for HistFactory gradient kernels — potentially replacing finite differences for profile likelihood and achieving O(1) gradient cost in the number of systematics instead of O(P).
7.5 Hybrid Analytical + Flow in a Single Kernel
For models where the signal is a flow PDF and the background is parametric (Chebyshev, Exponential), a planned hybrid kernel provides analytical gradients for background components + finite differences only for the signal flow. This reduces the number of flow evaluations from 2P+1 to 2Pflow+1, where Pflow ≪ P.
8.Performance Analysis: Theoretical Complexity
Notation: N = number of events, P = number of parameters, K = optimizer iterations C_pdf = cost of one log p(x|θ) C_flow = cost of one flow inference (ONNX) MoreFit (parametric PDF, symbolic gradient): T_total = K_sym × N × P × C_pdf K_sym ≈ 2–3 (exact gradient → fast L-BFGS convergence) NextStat Tier 2 (parametric PDF, analytical CUDA gradient): T_total = K_ana × N × P × C_pdf K_ana ≈ 2–5 (analytical gradient, but no automated CSE/hoisting) NextStat Tier 3 (flow PDF, finite differences): T_total = K_fd × (2P+1) × N × C_flow K_fd ≈ 10–30 (less precise gradient → more iterations) NextStat hybrid (P_param parametric + P_flow flow components): T_total = K_hyb × [N × P_param × C_pdf + (2P_flow+1) × N × C_flow] K_hyb ≈ 5–15
Conclusion: for purely parametric unbinned fits without systematics, MoreFit and NextStat Tier 2 deliver comparable performance. MoreFit gains an edge through automated graph optimization (CSE, hoisting). But as soon as the model includes rate modifiers (NormSys/WeightSys), neural PDFs (flow/ONNX), or requires a reproducible toy pipeline with batch/lockstep fitting, a compiler-symbolic approach stops being universal, while NextStat remains applicable through its hybrid architecture.
9.Conclusion
MoreFit proves a fundamental thesis: analytical derivatives on GPU yield enormous speedups for simple unbinned fits. Symbolic differentiation + JIT compilation is an elegant and powerful approach. For pure parametric models without systematics, MoreFit achieves ~10× speedup over RooFit CUDA.
But production physics analyses at LHCb, Belle II, ATLAS, CMS require:
- ›Systematics. In unbinned: at least yield rate modifiers (NormSys/WeightSys). In binned/HistFactory: shape/templating systematics (ShapeSys/HistoSys, etc.).
- ›Deterministic dataset definition:
selectionand observable expressions. Observed per-event weights are supported as non-negative frequency weights viadata.weight(ROOT TTrees) or an optional_weightParquet column; they must be finite, non-negative, and not all-zero (sum(w) > 0). - ›Arbitrary signal PDFs — analytical forms do not always exist.
- ›Binned fits — for channel combinations and legacy analyses.
- ›Validation — GPU↔CPU parity for NLL/∇NLL, CLI contract tests, and toy-based bias/pull/coverage checks.
NextStat's hybrid architecture (analytical CUDA kernels for parametric PDFs + ONNX flows for arbitrary PDFs + reverse-mode AD for binned models) covers all these requirements. The cost is a partial speed reduction for flow components (finite differences instead of analytical). But for production analyses, flexibility and completeness outweigh peak benchmark speed.
Trade-off: MoreFit = max speed × min functionality NextStat = (0.3–1.0)× MoreFit speed × full functionality For production analyses: NextStat covers a task class beyond “closed-form analytical PDFs without systematics” — at the cost of slower flow components (finite differences).
References
- ›C. Langenbruch, "MoreFit: A More Optimised, Rapid and Efficient Fit," arXiv:2505.12414 (2025)
- ›NextStat GPU kernels:
crates/ns-compute/kernels/unbinned_nll_grad.cu(3,150 lines, 6 PDFs + rate modifiers + analytical gradients) - ›NextStat GPU Flow Session:
crates/ns-inference/src/gpu_flow_session.rs(CPU/CUDA EP flow + GPU NLL reduction) - ›NextStat Reverse-mode AD:
crates/ns-ad/src/tape.rs(computation graph + backward pass) - ›NextStat Batch Toy Fits:
crates/ns-inference/src/unbinned_gpu_batch.rs(lockstep L-BFGS-B) - ›RooFit CUDA backend: L. Moneta et al., ROOT v6.32+ GPU acceleration
