NextStatNextStat
← Articles

Compiler-Symbolic vs Hybrid-Neural Approach:GPU-Accelerated Unbinned Fits in HEP

MoreFit (Langenbruch, EPJC 86, 2026) showed that symbolic differentiation + JIT-compiled computation graphs yield ~10× speedup over RooFit CUDA for "clean" unbinned fits. NextStat takes a different path — ONNX-backed normalizing flows (optional, feature neural) + fused CUDA kernels with analytical gradients + yield systematics (rate modifiers) from day one. This article dissects both approaches mathematically, compares architectures, and demonstrates why production physics analyses demand a hybrid pipeline.

GPUUNBINNEDONNXCUDAANALYTICAL GRADIENTSSYSTEMATICS

February 2026 · ~20 min read


1.Extended Unbinned Likelihood: Mathematical Formulation

Unbinned (event-level) analysis constructs the likelihood directly from observed events{xi}i=1…N, preserving all information that binning would discard. The extended likelihood for a model with P processes:

ℒ(θ) = Pois(N | ν_tot(θ)) · ∏ᵢ₌₁ᴺ f(xᵢ | θ)

where:
  ν_tot(θ) = Σₚ νₚ(θ)                      — total expected yield
  f(xᵢ | θ) = Σₚ νₚ(θ) · pₚ(xᵢ | θ) / ν_tot(θ)  — weighted mixture PDF

Negative log-likelihood:
  NLL(θ) = ν_tot(θ) − Σᵢ log[ Σₚ νₚ(θ) · pₚ(xᵢ | θ) ] + C_constr(θ)

  C_constr(θ) = ½ Σⱼ ((θⱼ − μⱼ) / σⱼ)²    — Gaussian constraints

The computational bottleneck is the inner sum over processes for each of N events. ForN = 50,000–500,000 (typical unbinned analyses at LHCb/Belle II) this dominates all other costs. GPUs are ideal: each event is an independent unit of work.

Minimizing NLL(θ) requires the gradient ∇θNLL. This is where the two approaches fundamentally diverge.


2.The MoreFit Approach: Compiler Optimizations on the Computation Graph

MoreFit (Langenbruch, EPJC 86, 2026) proposes a compiler-based approach to unbinned fits. The core idea: users describe PDFs as a symbolic computation graph, the framework optimizes the graph, then JIT-compiles it into a GPU kernel (OpenCL) or CPU code (LLVM/Clang with SIMD).

2.1 Computation Graph and Optimizations

The PDF is expressed as a DAG (directed acyclic graph) where nodes are arithmetic operations and leaves are parameters θ and observables x. Three key optimizations are applied:

  • Parameter-only term hoisting. All subgraphs depending only on θ (no x) are evaluated once on CPU and passed to the GPU kernel as constants. For normalized PDFs, this includes normalization integrals — they depend on (μ, σ) but not on x. Result: the GPU kernel computes only the unnormalized PDF; division by the norm uses a pre-computed constant.
  • CSE (Common Subexpression Elimination). If (x − μ) appears in two places in the graph, it is computed once. For Crystal Ball PDFs, where (x − μ)/σ is used in both the Gaussian core and the power-law tail, this yields 20–30% fewer arithmetic operations.
  • Symbolic differentiation. Partial derivatives ∂NLL/∂θⱼ are computed symbolically from the graph. This gives analytical gradients rather than numerical (finite differences) or AD (reverse-mode tape). The gradient is integrated into the same GPU kernel — one pass over the data computes both NLL and ∇NLL.

2.2 Result: Convergence in 2–3 Iterations

Numerical derivatives (central finite differences, h = 10⁻⁴) require 2P+1 function evaluations per optimizer iteration. For a model with P = 10 parameters — 21 NLL calls. Symbolic gradients deliver exact derivatives in a single call. With a quasi-Newton optimizer (L-BFGS), MoreFit demonstrates convergence in 2–3 iterations instead of ~85 function calls in RooFit CUDA with numerical derivatives.

Minimization cost:
  MoreFit:  K_sym × C_kernel    where K_sym ≈ 2–3 (L-BFGS iterations with exact gradient)
  RooFit:   K_num × (2P+1) × C_kernel   where K_num ≈ 15–30, P ≈ 10

  Speedup from gradients:  K_num × (2P+1) / K_sym ≈ 30 × 21 / 3 ≈ 210×
  Speedup from kernel (CSE + hoisting): ≈ 2–5×
  Total: ~10× over RooFit CUDA (author benchmarks)

2.3 Architectural Limitations

The compiler approach is optimal for a closed set of analytical PDFs. However:

  • Each new PDF family requires writing the formula, proving gradient correctness, testing.
  • Systematic uncertainties (nuisance parameters, rate modifiers) are deferred to "future versions."
  • Acceptance corrections and per-event weights/selection filters are not specified as part of the v0.1 contract.
  • Binned fits (HistFactory) are out of scope.
  • In multi-dimensional cases (>3D), analytical PDFs scale poorly — no closed forms for arbitrary correlations.

3.The NextStat Approach: Hybrid Parametric-Neural Pipeline

NextStat solves the same NLL(θ) minimization problem, but with fundamentally different architectural priorities: systematics, arbitrary PDFs, production-readiness. This is achieved through a three-tier gradient system and two GPU acceleration paths.

3.1 Three-Tier Gradient System

text
Tier 1: Reverse-mode AD (ns-ad crate, Tape)
  ├ Computation graph: Add, Sub, Mul, Div, Ln, Exp, Powf, Max
  ├ O(1) backward pass in number of parameters
  ├ Used for: general models, HistFactory, profile likelihood
  └ CPU only (Rust, no GPU)

Tier 2: Analytical CUDA/Metal kernels
  ├ Hand-derived ∂logp/∂θ for each PDF (Gaussian, Exponential,
  │   Crystal Ball, Double CB, Chebyshev, Histogram)
  ├ Fused NLL+grad kernel: 1 pass over data = NLL + ∇NLL
  ├ Grid-stride loop, shared memory, atomic grad accumulation
  ├ 1 CUDA block = 1 dataset (batch toy: 1 block = 1 toy)
  └ Lockstep L-BFGS-B: all toys at same iteration → 1 kernel launch

Tier 3: Central finite differences (h = 10⁻⁴)
  ├ For flow PDFs (ONNX = black box, no graph access)
  ├ 2P+1 calls to eval_logp per iteration
  ├ NLL reduction on GPU (CudaFlowNllAccelerator)
  └ Trade-off: flexibility > speed for flow components

Key insight: NextStat does not choose a single approach — it combines all three within a single model. Parametric components receive analytical gradients (Tier 2), flow components get finite differences (Tier 3), and the HistFactory wrapper uses reverse-mode AD (Tier 1).

3.2 Fused CUDA Kernel for Parametric PDFs

For parametric PDFs, NextStat uses the same principle as MoreFit — analytical derivatives fused into a single kernel. The unbinned_nll_grad.cu kernel (~3,150 lines of CUDA C) includes:

  • 6 PDF families with analytical gradients: Truncated Gaussian, Bounded Exponential, Crystal Ball, Double Crystal Ball, Chebyshev (arbitrary order), Histogram (bin lookup).
  • Rate modifiers: NormSys (log-normal), WeightSys (Code0 + Code4p polynomial interpolation) — yield systematics integrated directly into the gradient kernel.
  • Online logsumexp over processes for numerically stable mixture likelihood.
  • Batch toy architecture: 1 CUDA block = 1 toy dataset. All toys execute in parallel within a single kernel launch. Lockstep L-BFGS-B: all toys at the same optimizer iteration.
cuda
// Kernel architecture (simplified):
// 1 block = 1 dataset, threads = grid-stride loop over events

extern "C" __global__ void unbinned_batch_nll_grad(
    const double* params_flat,    // [n_toys × n_params]
    const double* obs_flat,       // [total_events]
    const uint*   toy_offsets,    // [n_toys + 1]
    /* ... model descriptors ... */
    double* nll_out,              // [n_toys]
    double* grad_out,             // [n_toys × n_params]
    uint n_params, uint n_procs, uint n_toys
) {
    uint toy_idx = blockIdx.x;
    // Load params[toy_idx] → shared memory
    // Grid-stride loop over events[toy_offsets[toy_idx]..toy_offsets[toy_idx+1]]
    //   For each event:
    //     logsumexp over processes: log Σₚ νₚ·pₚ(x|θ)
    //     Accumulate ∂NLL/∂θⱼ via atomicAdd (analytical)
    // Add Gaussian constraints + yield term
}

3.3 GPU Flow Session: NLL Reduction for Neural PDFs

For flow PDFs (normalizing flows exported via ONNX), NextStat splits computation into two stages:

text
Path 1: CPU Flow + GPU Reduce (Phase 3 MVP)
  ┌──────────────────┐     ┌─────────────────────┐
  │ FlowPdf::log_prob│────▶│ H2D upload logp_flat │
  │    (CPU / ONNX)  │     │   [n_procs × N]      │
  └──────────────────┘     └──────────┬────────────┘
                                      ▼
                           ┌─────────────────────┐
                           │ CudaFlowNllAccelerator│
                           │   logsumexp + yields  │
                           │   + constraints       │
                           │   → NLL (scalar)      │
                           └─────────────────────┘

Path 2: CUDA EP Flow + GPU Reduce (D4, zero-copy)
  ┌──────────────────┐     ┌─────────────────────┐
  │ ONNX Runtime     │────▶│ CudaSlice<f64>       │
  │  CUDA EP         │     │ (device-resident)    │
  └──────────────────┘     └──────────┬────────────┘
                                      ▼
                           ┌─────────────────────┐
                           │ nll_device()         │
                           │   (zero-copy path)   │
                           └─────────────────────┘

Gradient for flows: central finite differences, h = 10⁻⁴. Each optimizer iteration requires 2P+1 calls to eval_logp. Slower than analytical derivatives, but enables using any PDF exported as ONNX.

Implementation detail (contract): NextStat GPU kernels are compiled to PTX at build time (nvcc --ptx, see crates/ns-compute/build.rs). The flow “device-resident” path means the reduction can read a GPU logp buffer directly (CudaFlowNllAccelerator::nll_device), while yields/params remain small host→device uploads. This is a fundamentally different JIT class than MoreFit’s DAG→kernel compilation.


4.Gradient Strategies: Mathematical Comparison

Let NLL(θ) = L(θ). We need ∇θL for P parameters.

4.1 Symbolic Differentiation (MoreFit)

For each parameter θⱼ:

  ∂L/∂θⱼ = ∂ν_tot/∂θⱼ − Σᵢ [ Σₚ (∂νₚ/∂θⱼ · pₚ(xᵢ) + νₚ · ∂pₚ/∂θⱼ) ] / [ Σₚ νₚ · pₚ(xᵢ) ]
            + Σⱼ (θⱼ − μⱼ) / σⱼ²

All derivatives ∂pₚ/∂θⱼ computed symbolically from the graph. Result:
  — Precision: machine epsilon (~10⁻¹⁶)
  — Cost: 1 pass over data (fused with NLL)
  — L-BFGS convergence: 2–3 iterations
  — Limitation: only for PDFs with known analytical form

4.2 Analytical CUDA Kernels (NextStat Tier 2)

Same principle, but hand-implemented in CUDA C rather than compiler-generated.

Example: Truncated Gaussian on [a, b]

  log p(x | μ, σ) = −½((x−μ)/σ)² − log σ − log Z
  Z = Φ((b−μ)/σ) − Φ((a−μ)/σ)

  ∂log p/∂μ = (x−μ)/σ² − (φ(zₐ) − φ(z_b)) / (σ · Z)
  ∂log p/∂σ = ((x−μ)² − σ²) / σ³ − (zₐ·φ(zₐ) − z_b·φ(z_b)) / (σ · Z)

  where zₐ = (a−μ)/σ, z_b = (b−μ)/σ, φ = standard normal PDF, Φ = CDF

Similarly for Crystal Ball (5 params: μ, σ, α, n + normalization),
Double Crystal Ball (7 params), Chebyshev (arbitrary order).

Cost: identical to MoreFit for parametric PDFs
Advantage: rate modifiers (NormSys, WeightSys) integrated in kernel

4.3 Reverse-mode AD (NextStat Tier 1, ns-ad Tape)

Records computation graph (forward pass), then single backward sweep:

  tape.var(θ₁), tape.var(θ₂), ...
  z = tape.mul(x, y)    // records z = x·y and ∂z/∂x = y, ∂z/∂y = x
  w = tape.add(z, x)    // records w = z+x and ∂w/∂z = 1, ∂w/∂x = 1
  tape.backward(w)       // backward pass: O(|graph|)
  tape.adjoint(θⱼ)      // dw/dθⱼ

Backward cost: O(|graph|) ≈ O(N·P) — single backward pass
Compared to finite differences: O(N·P) vs O(N·P²)
Limitation: CPU only, no GPU version (yet)
Used for: HistFactory binned models, profile likelihood

4.4 Finite Differences (NextStat Tier 3)

Central scheme:

  ∂L/∂θⱼ ≈ (L(θ + h·eⱼ) − L(θ − h·eⱼ)) / (2h)    h = 10⁻⁴

Cost: 2P+1 evaluations of L(θ) per iteration
Precision: O(h²) ≈ 10⁻⁸ (sufficient for L-BFGS, not for Hessian)

Used for: flow PDFs (ONNX = black box)
Compensation: NLL reduction on GPU, flow eval on GPU (CUDA EP)
  → bottleneck: flow inference, not reduction

5.Feature Comparison

FeatureMoreFit v0.1NextStat
Analytical gradientsSymbolic from graphHand-derived CUDA + reverse-mode AD
GPU backendOpenCL (vendor-agnostic)CUDA + Metal (Apple Silicon)
JIT kernel compilationYes (graph → kernel)No (kernels compile to PTX at build time; driver may JIT PTX→SASS on newer SMs)
Parameter hoistingAutomatic (from graph)Explicit: closed-form normalization for built-in PDFs + hand-managed constants; no general DAG optimizer
CSEAutomaticManual (in CUDA C code)
Systematics (NormSys)No (deferred)Yes, in GPU kernel
Systematics (WeightSys)No (deferred)Yes (Code0 + Code4p interp)
Conditional flowsNop(x|α): α as context vector
Selection / weightsLimitedSelection (ns-root expr). Observed frequency weights supported (ROOT data.weight or Parquet _weight; finite, ≥0, sum(w)>0)
Binned fitsNoFull HistFactory + GPU
Arbitrary PDFsBuilt-in library onlyTrain flow → ONNX → drop-in
Template morphingNoDCR surrogate (neural)
Batch toy fitsNot describedLockstep L-BFGS-B, 1 kernel
Device-resident toysNot describedToy sampling → batch fit without H2D obs_flat (H2D only toy_offsets)
ValidationSpeed benchmarksGPU↔CPU parity tests (cuda/metal), CLI contract tests, toy-fit guardrails
Multi-channelNot describedSum NLL/grad across channels

6.Architectural Justification: Why ONNX over Symbolic JIT

6.1 The Extensibility Problem

MoreFit: each new PDF family → formula → symbolic derivatives → correctness test → integration into codegen. For Crystal Ball this is ~200 lines, for Double Crystal Ball — ~400. For arbitrary multi-dimensional correlation structures — impossible.

NextStat: train a normalizing flow on Monte Carlo data → torch.onnx.export() → file. No new GPU code needed. The flow automatically captures correlations of any dimensionality.

6.2 The Systematics Problem in Unbinned

In real analyses, PDFs depend on nuisance parameters α: p(x | θ, α). MoreFit has no mechanism for morphing PDFs along α.

NextStat: conditional normalizing flow accepts α as a context vector. A single flow is trained on samples at different α values. At fit time, the flow computes log p(x | θ, α) for arbitrary α values — smooth morphing across the entire systematics space without discrete templates.

6.3 The High-Dimensionality Problem

Analytical PDFs scale poorly beyond 3D: no closed forms for complex correlation structures. Normalizing flows operate in arbitrary dimension D and automatically model correlations through training. For analyses with multiple observables (minv, cos θ, φ, Δη, ...) this is a critical advantage.


7.NextStat R&D: Work in Progress

7.1 Device-Resident Toy Pipeline

Current implementation (Phase 1, 1D): sampling on GPU → toy events remain in device memory → batch fit without H2D copy of the main event buffer. Contract: toy offsets are computed on host and uploaded as a small buffer ((n_toys+1)×4 bytes), while the large obs_flat stays device-resident. API: fit_unbinned_toys_batch_cuda_device(). Toy offsets (~4 bytes × n_toys) are the only host→device transfer.

Note: the current GPU toy sampler is deliberately gated to analytic 1D PDFs (Gaussian/Exponential) to avoid silent “midpoint sampling” fallbacks and preserve correctness guarantees.

7.2 Multi-Channel Batch Fits

Models with multiple channels (categories): NLL = Σch NLLch. Each channel has its own set of processes but shared parameters. Implemented via CudaUnbinnedBatchAccelerator per channel, with NLL/grad summation on host. Gaussian constraints attached to a single channel to avoid double-counting.

7.3 Metal Backend (Apple Silicon)

Fully parallel implementation in Metal Shading Language for M1/M2/M3. Same lockstep L-BFGS-B, same analytical gradients, f32 (Metal lacks native f64). The accuracy contract is fixed by parity tests: Metal (f32, atomic accumulation) is compared against CPU with relaxed tolerances on the order of 1e−2 abs and ~1e−6 rel (see crates/ns-compute/tests/unbinned_gpu_parity.rs).

7.4 Reverse-mode AD on GPU (Research)

The current ns-ad::Tape runs on CPU. Exploring a GPU version for HistFactory gradient kernels — potentially replacing finite differences for profile likelihood and achieving O(1) gradient cost in the number of systematics instead of O(P).

7.5 Hybrid Analytical + Flow in a Single Kernel

For models where the signal is a flow PDF and the background is parametric (Chebyshev, Exponential), a planned hybrid kernel provides analytical gradients for background components + finite differences only for the signal flow. This reduces the number of flow evaluations from 2P+1 to 2Pflow+1, where Pflow ≪ P.


8.Performance Analysis: Theoretical Complexity

Notation:
  N = number of events, P = number of parameters, K = optimizer iterations
  C_pdf = cost of one log p(x|θ)
  C_flow = cost of one flow inference (ONNX)

MoreFit (parametric PDF, symbolic gradient):
  T_total = K_sym × N × P × C_pdf
  K_sym ≈ 2–3 (exact gradient → fast L-BFGS convergence)

NextStat Tier 2 (parametric PDF, analytical CUDA gradient):
  T_total = K_ana × N × P × C_pdf
  K_ana ≈ 2–5 (analytical gradient, but no automated CSE/hoisting)

NextStat Tier 3 (flow PDF, finite differences):
  T_total = K_fd × (2P+1) × N × C_flow
  K_fd ≈ 10–30 (less precise gradient → more iterations)

NextStat hybrid (P_param parametric + P_flow flow components):
  T_total = K_hyb × [N × P_param × C_pdf + (2P_flow+1) × N × C_flow]
  K_hyb ≈ 5–15

Conclusion: for purely parametric unbinned fits without systematics, MoreFit and NextStat Tier 2 deliver comparable performance. MoreFit gains an edge through automated graph optimization (CSE, hoisting). But as soon as the model includes rate modifiers (NormSys/WeightSys), neural PDFs (flow/ONNX), or requires a reproducible toy pipeline with batch/lockstep fitting, a compiler-symbolic approach stops being universal, while NextStat remains applicable through its hybrid architecture.


9.Conclusion

MoreFit proves a fundamental thesis: analytical derivatives on GPU yield enormous speedups for simple unbinned fits. Symbolic differentiation + JIT compilation is an elegant and powerful approach. For pure parametric models without systematics, MoreFit achieves ~10× speedup over RooFit CUDA.

But production physics analyses at LHCb, Belle II, ATLAS, CMS require:

  • Systematics. In unbinned: at least yield rate modifiers (NormSys/WeightSys). In binned/HistFactory: shape/templating systematics (ShapeSys/HistoSys, etc.).
  • Deterministic dataset definition: selection and observable expressions. Observed per-event weights are supported as non-negative frequency weights via data.weight(ROOT TTrees) or an optional _weight Parquet column; they must be finite, non-negative, and not all-zero (sum(w) > 0).
  • Arbitrary signal PDFs — analytical forms do not always exist.
  • Binned fits — for channel combinations and legacy analyses.
  • Validation — GPU↔CPU parity for NLL/∇NLL, CLI contract tests, and toy-based bias/pull/coverage checks.

NextStat's hybrid architecture (analytical CUDA kernels for parametric PDFs + ONNX flows for arbitrary PDFs + reverse-mode AD for binned models) covers all these requirements. The cost is a partial speed reduction for flow components (finite differences instead of analytical). But for production analyses, flexibility and completeness outweigh peak benchmark speed.

Trade-off:
  MoreFit   = max speed × min functionality
  NextStat  = (0.3–1.0)× MoreFit speed × full functionality

  For production analyses: NextStat covers a task class beyond
  “closed-form analytical PDFs without systematics” — at the cost
  of slower flow components (finite differences).

References

  • C. Langenbruch, "MoreFit: A More Optimised, Rapid and Efficient Fit," arXiv:2505.12414 (2025)
  • NextStat GPU kernels: crates/ns-compute/kernels/unbinned_nll_grad.cu (3,150 lines, 6 PDFs + rate modifiers + analytical gradients)
  • NextStat GPU Flow Session: crates/ns-inference/src/gpu_flow_session.rs (CPU/CUDA EP flow + GPU NLL reduction)
  • NextStat Reverse-mode AD: crates/ns-ad/src/tape.rs (computation graph + backward pass)
  • NextStat Batch Toy Fits: crates/ns-inference/src/unbinned_gpu_batch.rs (lockstep L-BFGS-B)
  • RooFit CUDA backend: L. Moneta et al., ROOT v6.32+ GPU acceleration