NextStatNextStat

GPU Acceleration

NextStat supports GPU-accelerated toy generation and batch NLL computation via CUDA (NVIDIA) and Metal (Apple Silicon). The GPU backend is selected at runtime through the CLI or Rust API.

Backends

BackendPrecisionPlatformRequirement
CUDAf64 (double)NVIDIA GPUsnvcc + CUDA toolkit
Metalf32 (float)Apple SiliconmacOS (built-in)

Building with GPU Support

# CUDA (requires nvcc in PATH)
cargo build --workspace --features cuda

# Metal (Apple Silicon, macOS)
cargo build --workspace --features metal

CLI Usage

# NVIDIA GPU (f64)
nextstat hypotest-toys --input workspace.json \
  --mu 1.0 --n-toys 10000 --gpu cuda

# Apple Silicon GPU (f32)
nextstat hypotest-toys --input workspace.json \
  --mu 1.0 --n-toys 10000 --gpu metal

GPU-Resident Toy Pipeline

The --gpu-sample-toys flag keeps sampled events on the GPU device, eliminating the D2H+H2D round-trip of the large obs_flat buffer between sampler and batch fitter.

nextstat hypotest-toys --input workspace.json \
  --mu 1.0 --n-toys 10000 --gpu cuda --gpu-sample-toys

Unbinned GPU WeightSys

The weightsys rate modifier is now lowered to CUDA/Metal kernels with code0 and code4p interpolation support. In the unbinned model YAML spec:

modifiers:
  - type: weightsys
    param: alpha_jet_jes
    lo: 0.95
    hi: 1.05
    interp_code: code4p  # optional, defaults to code0

GPU Flow Evaluation

Flow PDFs can leverage GPU acceleration for NLL reduction. Two paths are supported:

  • Path 1: CPU flow + GPU NLL — The flow evaluates log p(x|θ) on CPU (ONNX Runtime), the result is uploaded to GPU where a dedicated CUDA kernel performs extended unbinned likelihood reduction. GpuFlowSession manages the pipeline automatically.
  • Path 2: CUDA EP + I/O binding — With --features neural-cuda, ONNX Runtime uses the CUDA Execution Provider. The flow forward pass runs on GPU, log-prob stays device-resident. NLL reduction reads directly from GPU memory — zero host↔device copies.

Gradients via central finite differences (ε=1e-4): 2·n_params + 1 NLL calls per iteration.

ScenarioRecommendation
N < 10k events, few paramsCPU only (flow + NLL)
N > 50k events, multi-process modelCPU flow + GPU NLL (Path 1)
N > 100k events, NVIDIA GPUCUDA EP + GPU NLL (Path 2)
Batch toys (1000+)GPU NLL with batch kernel

CPU Acceleration

Even without a GPU, NextStat leverages multiple CPU acceleration strategies:

  • SIMD auto-vectorization — Compiler-generated vector instructions for batch operations
  • Rayon parallelism — Work-stealing thread pool for toy generation and parameter scans
  • Apple Accelerate — vDSP and vForce for vectorized math on macOS (log, exp, etc.)

Differentiable Layer

Both CUDA and Metal support the differentiable NLL layer for PyTorch integration:

FeatureCUDAMetal
Signal uploadZero-copy via raw pointerCPU → GPU (f64→f32)
Gradient returnZero-copy or Vec<f64>Vec<f64> (f32→f64)
Profiled q₀/qμGPU L-BFGS-B + envelope theoremSame algorithm, f32 precision
Multi-channel signalSupportedSupported
PyTorch integrationDirect (same CUDA context)Via CPU tensor bridge

Batch Toy Fitting

Both CUDA and Metal support GPU-accelerated batch toy fitting for CLs hypothesis testing:

Entry PointDescription
fit_toys_batch_gpu / fit_toys_batch_metalHigh-level: generate toys from model params
fit_toys_from_data_gpu / fit_toys_from_data_metalLow-level: custom expected data, init, bounds
hypotest_qtilde_toys_gpuFull CLs workflow: Phase A (CPU baseline) + Phase B (GPU ensemble)

Architecture: Phase A performs 3 baseline CPU fits (free, conditional at μ_test, conditional at μ=0), then Phase B dispatches to the appropriate GPU backend for batch toy ensemble generation.

# CUDA toy-based CLs
nextstat hypotest-toys --input workspace.json \
  --mu 1.0 --n-toys 10000 --gpu cuda

# Metal toy-based CLs
nextstat hypotest-toys --input workspace.json \
  --mu 1.0 --n-toys 10000 --gpu metal

Performance Benchmarks

CUDA: GEX44 · RTX 4000 SFF Ada (20 GB) · CUDA 12.0 · AMD EPYC 8 cores

Metal: Apple M5 · macOS · unified memory

Build: --release --features cuda|metal

Single-Operation Latency

OperationCPUCUDAWinner
MLE fit (8 params)2.3 ms136.3 msCPU 59×
MLE fit (184 params)520.8 ms1,272.0 msCPU 2.4×
Profile scan (184p, 21pt)8.4 s7.9 sGPU 1.07×
Diff NLL + grad (8 params)0.12 msGPU-only
Diff NLL + grad (184 params)3.66 msGPU-only
Profiled q₀ (8 params)3.0 msGPU-only
NN training loop2.4 ms/stepGPU-only

Batch Toys — Large Model (tHu, 184 params)

GPU lockstep amortizes kernel overhead → sub-linear scaling

ToysCUDA f64 (RTX 4000)Metal f32 (M5)
GPUCPU×GPUCPU×
10020.2 s37.9 s1.8×10.7 s29.8 s2.8×
50063.4 s383.7 s6.0×29.1 s175.5 s6.0×
1,000119.9 s771.4 s6.4×56.8 s359.1 s6.3×

Cross-Platform Summary (1,000 toys, 184 params)

BackendGPUCPUSpeedup
CUDA f64 (RTX 4000 SFF Ada)119.9 s771.4 s6.4×
Metal f32 (Apple M5)56.8 s359.1 s6.3×

Batch Toys — Small Model (complex, 8 params)

Kernel launch overhead dominates → CPU wins on both platforms

ToysGPU (CUDA)CPU (8 cores)Speedup
100726 ms18 msCPU 40×
5001,169 ms23 msCPU 51×
1,0001,838 ms40 msCPU 46×
5,0007,412 ms146 msCPU 51×

Key Findings

  • Convergent speedup factor ~6.3× — Both CUDA (f64) and Metal (f32) converge to the same GPU/CPU ratio despite different precision, architecture, and absolute performance. This is a fundamental property of the lockstep L-BFGS-B batch architecture.
  • GPU batch scaling is sub-linear — 10× toys → 5.9× time. Lockstep execution amortizes kernel launch overhead across the batch.
  • CPU scaling is super-linear on large models — 10× toys → 20.3× time for 184 params due to L3 cache pressure.
  • Crossover at ~100 parameters — GPU wins for models with ~100+ parameters. Below that, CPU Rayon parallelism dominates due to negligible per-toy overhead.

Recommendation: Use --gpu cuda or --gpu metal for batch toy workloads on models with 100+ parameters. Use CPU (default) for small models and single-model fits.

Metal Limitations

Current Metal Status

  • Batch toy fitting — fully supported (f32 precision, ~1.27e-6 NLL parity vs CPU f64).
  • Differentiable NLL + gradient — fully supported via CPU tensor bridge.
  • Rankingnot yet supported. The server and CLI return a clear error. Use CPU for ranking on Apple Silicon.

Known Issues (Fixed)

  • Batch toys memcpy_dtoh panic — cudarc 0.19 requires dst.len() >= src.len() for device-to-host copies. When toys converge and n_active < max_batch, the host buffer was too small. Fix: allocate host buffers at max_batch size and truncate.
  • ProfiledDifferentiableSession convergence — L-BFGS-B tolerance 1e-6 too tight for projected gradient near parameter bounds. Fix: tolerance 1e-5 + NLL stability criterion.

Validation

# Single-model fit + gradient parity (CUDA)
cargo test -p ns-inference --features cuda -- --nocapture

# Metal batch tests
cargo test -p ns-inference --features metal -- --nocapture

# Python GPU parity
pytest tests/python/test_gpu_parity.py -v

Rust integration tests in crates/ns-inference/src/gpu_single.rs:

  • test_gpu_nll_matches_cpu
  • test_gpu_grad_matches_cpu
  • test_gpu_fit_matches_cpu
  • test_gpu_session_reuse
  • test_gpu_complex_workspace
  • test_gpu_nll_and_grad_at_multiple_points

Tolerance source of truth: tests/python/_tolerances.py (Python) · crates/ns-inference/src/gpu_single.rs (Rust)