GPU Acceleration

NextStat supports GPU-accelerated toy generation and batch NLL computation via CUDA (NVIDIA) and Metal (Apple Silicon). The GPU backend is selected at runtime through the CLI or Rust API.

Backends

Backend	Precision	Platform	Requirement
CUDA	f64 (double)	NVIDIA GPUs	nvcc + CUDA toolkit
Metal	f32 (float)	Apple Silicon	macOS (built-in)

Building with GPU Support

# CUDA (requires nvcc in PATH)
cargo build --workspace --features cuda

# Metal (Apple Silicon, macOS)
cargo build --workspace --features metal

CLI Usage

# NVIDIA GPU (f64)
nextstat hypotest-toys --input workspace.json \
  --mu 1.0 --n-toys 10000 --gpu cuda

# Apple Silicon GPU (f32)
nextstat hypotest-toys --input workspace.json \
  --mu 1.0 --n-toys 10000 --gpu metal

GPU-Resident Toy Pipeline

The --gpu-sample-toys flag keeps sampled events on the GPU device, eliminating the D2H+H2D round-trip of the large obs_flat buffer between sampler and batch fitter.

nextstat hypotest-toys --input workspace.json \
  --mu 1.0 --n-toys 10000 --gpu cuda --gpu-sample-toys

Unbinned GPU WeightSys

The weightsys rate modifier is now lowered to CUDA/Metal kernels with code0 and code4p interpolation support. In the unbinned model YAML spec:

modifiers:
  - type: weightsys
    param: alpha_jet_jes
    lo: 0.95
    hi: 1.05
    interp_code: code4p  # optional, defaults to code0

GPU Flow Evaluation

Flow PDFs can leverage GPU acceleration for NLL reduction. Two paths are supported:

Path 1: CPU flow + GPU NLL — The flow evaluates log p(x|θ) on CPU (ONNX Runtime), the result is uploaded to GPU where a dedicated CUDA kernel performs extended unbinned likelihood reduction. GpuFlowSession manages the pipeline automatically.
Path 2: CUDA EP + I/O binding — With --features neural-cuda, ONNX Runtime uses the CUDA Execution Provider. The flow forward pass runs on GPU, log-prob stays device-resident. NLL reduction reads directly from GPU memory — zero host↔device copies.

Gradients via central finite differences (ε=1e-4): 2·n_params + 1 NLL calls per iteration.

Scenario	Recommendation
N < 10k events, few params	CPU only (flow + NLL)
N > 50k events, multi-process model	CPU flow + GPU NLL (Path 1)
N > 100k events, NVIDIA GPU	CUDA EP + GPU NLL (Path 2)
Batch toys (1000+)	GPU NLL with batch kernel

CPU Acceleration

Even without a GPU, NextStat leverages multiple CPU acceleration strategies:

SIMD auto-vectorization — Compiler-generated vector instructions for batch operations
Rayon parallelism — Work-stealing thread pool for toy generation and parameter scans
Apple Accelerate — vDSP and vForce for vectorized math on macOS (log, exp, etc.)

Differentiable Layer

Both CUDA and Metal support the differentiable NLL layer for PyTorch integration:

Feature	CUDA	Metal
Signal upload	Zero-copy via raw pointer	CPU → GPU (f64→f32)
Gradient return	Zero-copy or Vec<f64>	Vec<f64> (f32→f64)
Profiled q₀/qμ	GPU L-BFGS-B + envelope theorem	Same algorithm, f32 precision
Multi-channel signal	Supported	Supported
PyTorch integration	Direct (same CUDA context)	Via CPU tensor bridge

Batch Toy Fitting

Both CUDA and Metal support GPU-accelerated batch toy fitting for CLs hypothesis testing:

Entry Point	Description
`fit_toys_batch_gpu` / `fit_toys_batch_metal`	High-level: generate toys from model params
`fit_toys_from_data_gpu` / `fit_toys_from_data_metal`	Low-level: custom expected data, init, bounds
`hypotest_qtilde_toys_gpu`	Full CLs workflow: Phase A (CPU baseline) + Phase B (GPU ensemble)

Architecture: Phase A performs 3 baseline CPU fits (free, conditional at μ_test, conditional at μ=0), then Phase B dispatches to the appropriate GPU backend for batch toy ensemble generation.

# CUDA toy-based CLs
nextstat hypotest-toys --input workspace.json \
  --mu 1.0 --n-toys 10000 --gpu cuda

# Metal toy-based CLs
nextstat hypotest-toys --input workspace.json \
  --mu 1.0 --n-toys 10000 --gpu metal

Performance Benchmarks

CUDA: GEX44 · RTX 4000 SFF Ada (20 GB) · CUDA 12.0 · AMD EPYC 8 cores

Metal: Apple M5 · macOS · unified memory

Build: --release --features cuda|metal

Single-Operation Latency

Operation	CPU	CUDA	Winner
MLE fit (8 params)	2.3 ms	136.3 ms	CPU 59×
MLE fit (184 params)	520.8 ms	1,272.0 ms	CPU 2.4×
Profile scan (184p, 21pt)	8.4 s	7.9 s	GPU 1.07×
Diff NLL + grad (8 params)	—	0.12 ms	GPU-only
Diff NLL + grad (184 params)	—	3.66 ms	GPU-only
Profiled q₀ (8 params)	—	3.0 ms	GPU-only
NN training loop	—	2.4 ms/step	GPU-only

Batch Toys — Large Model (tHu, 184 params)

GPU lockstep amortizes kernel overhead → sub-linear scaling

Toys	CUDA f64 (RTX 4000)			Metal f32 (M5)
Toys	GPU	CPU	×	GPU	CPU	×
100	20.2 s	37.9 s	1.8×	10.7 s	29.8 s	2.8×
500	63.4 s	383.7 s	6.0×	29.1 s	175.5 s	6.0×
1,000	119.9 s	771.4 s	6.4×	56.8 s	359.1 s	6.3×

Cross-Platform Summary (1,000 toys, 184 params)

Backend	GPU	CPU	Speedup
CUDA f64 (RTX 4000 SFF Ada)	119.9 s	771.4 s	6.4×
Metal f32 (Apple M5)	56.8 s	359.1 s	6.3×

Batch Toys — Small Model (complex, 8 params)

Kernel launch overhead dominates → CPU wins on both platforms

Toys	GPU (CUDA)	CPU (8 cores)	Speedup
100	726 ms	18 ms	CPU 40×
500	1,169 ms	23 ms	CPU 51×
1,000	1,838 ms	40 ms	CPU 46×
5,000	7,412 ms	146 ms	CPU 51×

Key Findings

Convergent speedup factor ~6.3× — Both CUDA (f64) and Metal (f32) converge to the same GPU/CPU ratio despite different precision, architecture, and absolute performance. This is a fundamental property of the lockstep L-BFGS-B batch architecture.
GPU batch scaling is sub-linear — 10× toys → 5.9× time. Lockstep execution amortizes kernel launch overhead across the batch.
CPU scaling is super-linear on large models — 10× toys → 20.3× time for 184 params due to L3 cache pressure.
Crossover at ~100 parameters — GPU wins for models with ~100+ parameters. Below that, CPU Rayon parallelism dominates due to negligible per-toy overhead.

Recommendation: Use --gpu cuda or --gpu metal for batch toy workloads on models with 100+ parameters. Use CPU (default) for small models and single-model fits.

Metal Limitations

Current Metal Status

›Batch toy fitting — fully supported (f32 precision, ~1.27e-6 NLL parity vs CPU f64).
›Differentiable NLL + gradient — fully supported via CPU tensor bridge.
›Ranking — not yet supported. The server and CLI return a clear error. Use CPU for ranking on Apple Silicon.

Known Issues (Fixed)

Batch toys memcpy_dtoh panic — cudarc 0.19 requires dst.len() >= src.len() for device-to-host copies. When toys converge and n_active < max_batch, the host buffer was too small. Fix: allocate host buffers at max_batch size and truncate.
ProfiledDifferentiableSession convergence — L-BFGS-B tolerance 1e-6 too tight for projected gradient near parameter bounds. Fix: tolerance 1e-5 + NLL stability criterion.

Validation

# Single-model fit + gradient parity (CUDA)
cargo test -p ns-inference --features cuda -- --nocapture

# Metal batch tests
cargo test -p ns-inference --features metal -- --nocapture

# Python GPU parity
pytest tests/python/test_gpu_parity.py -v

Rust integration tests in crates/ns-inference/src/gpu_single.rs:

test_gpu_nll_matches_cpu
test_gpu_grad_matches_cpu
test_gpu_fit_matches_cpu
test_gpu_session_reuse
test_gpu_complex_workspace
test_gpu_nll_and_grad_at_multiple_points

Tolerance source of truth: tests/python/_tolerances.py (Python) · crates/ns-inference/src/gpu_single.rs (Rust)