GPU Acceleration
NextStat supports GPU-accelerated toy generation and batch NLL computation via CUDA (NVIDIA) and Metal (Apple Silicon). The GPU backend is selected at runtime through the CLI or Rust API.
Backends
| Backend | Precision | Platform | Requirement |
|---|---|---|---|
| CUDA | f64 (double) | NVIDIA GPUs | nvcc + CUDA toolkit |
| Metal | f32 (float) | Apple Silicon | macOS (built-in) |
Building with GPU Support
# CUDA (requires nvcc in PATH) cargo build --workspace --features cuda # Metal (Apple Silicon, macOS) cargo build --workspace --features metal
CLI Usage
# NVIDIA GPU (f64) nextstat hypotest-toys --input workspace.json \ --mu 1.0 --n-toys 10000 --gpu cuda # Apple Silicon GPU (f32) nextstat hypotest-toys --input workspace.json \ --mu 1.0 --n-toys 10000 --gpu metal
GPU-Resident Toy Pipeline
The --gpu-sample-toys flag keeps sampled events on the GPU device, eliminating the D2H+H2D round-trip of the large obs_flat buffer between sampler and batch fitter.
nextstat hypotest-toys --input workspace.json \ --mu 1.0 --n-toys 10000 --gpu cuda --gpu-sample-toys
Unbinned GPU WeightSys
The weightsys rate modifier is now lowered to CUDA/Metal kernels with code0 and code4p interpolation support. In the unbinned model YAML spec:
modifiers:
- type: weightsys
param: alpha_jet_jes
lo: 0.95
hi: 1.05
interp_code: code4p # optional, defaults to code0GPU Flow Evaluation
Flow PDFs can leverage GPU acceleration for NLL reduction. Two paths are supported:
- Path 1: CPU flow + GPU NLL — The flow evaluates
log p(x|θ)on CPU (ONNX Runtime), the result is uploaded to GPU where a dedicated CUDA kernel performs extended unbinned likelihood reduction.GpuFlowSessionmanages the pipeline automatically. - Path 2: CUDA EP + I/O binding — With
--features neural-cuda, ONNX Runtime uses the CUDA Execution Provider. The flow forward pass runs on GPU, log-prob stays device-resident. NLL reduction reads directly from GPU memory — zero host↔device copies.
Gradients via central finite differences (ε=1e-4): 2·n_params + 1 NLL calls per iteration.
| Scenario | Recommendation |
|---|---|
| N < 10k events, few params | CPU only (flow + NLL) |
| N > 50k events, multi-process model | CPU flow + GPU NLL (Path 1) |
| N > 100k events, NVIDIA GPU | CUDA EP + GPU NLL (Path 2) |
| Batch toys (1000+) | GPU NLL with batch kernel |
CPU Acceleration
Even without a GPU, NextStat leverages multiple CPU acceleration strategies:
- SIMD auto-vectorization — Compiler-generated vector instructions for batch operations
- Rayon parallelism — Work-stealing thread pool for toy generation and parameter scans
- Apple Accelerate — vDSP and vForce for vectorized math on macOS (log, exp, etc.)
Differentiable Layer
Both CUDA and Metal support the differentiable NLL layer for PyTorch integration:
| Feature | CUDA | Metal |
|---|---|---|
| Signal upload | Zero-copy via raw pointer | CPU → GPU (f64→f32) |
| Gradient return | Zero-copy or Vec<f64> | Vec<f64> (f32→f64) |
| Profiled q₀/qμ | GPU L-BFGS-B + envelope theorem | Same algorithm, f32 precision |
| Multi-channel signal | Supported | Supported |
| PyTorch integration | Direct (same CUDA context) | Via CPU tensor bridge |
Batch Toy Fitting
Both CUDA and Metal support GPU-accelerated batch toy fitting for CLs hypothesis testing:
| Entry Point | Description |
|---|---|
fit_toys_batch_gpu / fit_toys_batch_metal | High-level: generate toys from model params |
fit_toys_from_data_gpu / fit_toys_from_data_metal | Low-level: custom expected data, init, bounds |
hypotest_qtilde_toys_gpu | Full CLs workflow: Phase A (CPU baseline) + Phase B (GPU ensemble) |
Architecture: Phase A performs 3 baseline CPU fits (free, conditional at μ_test, conditional at μ=0), then Phase B dispatches to the appropriate GPU backend for batch toy ensemble generation.
# CUDA toy-based CLs nextstat hypotest-toys --input workspace.json \ --mu 1.0 --n-toys 10000 --gpu cuda # Metal toy-based CLs nextstat hypotest-toys --input workspace.json \ --mu 1.0 --n-toys 10000 --gpu metal
Performance Benchmarks
CUDA: GEX44 · RTX 4000 SFF Ada (20 GB) · CUDA 12.0 · AMD EPYC 8 cores
Metal: Apple M5 · macOS · unified memory
Build: --release --features cuda|metal
Single-Operation Latency
| Operation | CPU | CUDA | Winner |
|---|---|---|---|
| MLE fit (8 params) | 2.3 ms | 136.3 ms | CPU 59× |
| MLE fit (184 params) | 520.8 ms | 1,272.0 ms | CPU 2.4× |
| Profile scan (184p, 21pt) | 8.4 s | 7.9 s | GPU 1.07× |
| Diff NLL + grad (8 params) | — | 0.12 ms | GPU-only |
| Diff NLL + grad (184 params) | — | 3.66 ms | GPU-only |
| Profiled q₀ (8 params) | — | 3.0 ms | GPU-only |
| NN training loop | — | 2.4 ms/step | GPU-only |
Batch Toys — Large Model (tHu, 184 params)
GPU lockstep amortizes kernel overhead → sub-linear scaling
| Toys | CUDA f64 (RTX 4000) | Metal f32 (M5) | ||||
|---|---|---|---|---|---|---|
| GPU | CPU | × | GPU | CPU | × | |
| 100 | 20.2 s | 37.9 s | 1.8× | 10.7 s | 29.8 s | 2.8× |
| 500 | 63.4 s | 383.7 s | 6.0× | 29.1 s | 175.5 s | 6.0× |
| 1,000 | 119.9 s | 771.4 s | 6.4× | 56.8 s | 359.1 s | 6.3× |
Cross-Platform Summary (1,000 toys, 184 params)
| Backend | GPU | CPU | Speedup |
|---|---|---|---|
| CUDA f64 (RTX 4000 SFF Ada) | 119.9 s | 771.4 s | 6.4× |
| Metal f32 (Apple M5) | 56.8 s | 359.1 s | 6.3× |
Batch Toys — Small Model (complex, 8 params)
Kernel launch overhead dominates → CPU wins on both platforms
| Toys | GPU (CUDA) | CPU (8 cores) | Speedup |
|---|---|---|---|
| 100 | 726 ms | 18 ms | CPU 40× |
| 500 | 1,169 ms | 23 ms | CPU 51× |
| 1,000 | 1,838 ms | 40 ms | CPU 46× |
| 5,000 | 7,412 ms | 146 ms | CPU 51× |
Key Findings
- Convergent speedup factor ~6.3× — Both CUDA (f64) and Metal (f32) converge to the same GPU/CPU ratio despite different precision, architecture, and absolute performance. This is a fundamental property of the lockstep L-BFGS-B batch architecture.
- GPU batch scaling is sub-linear — 10× toys → 5.9× time. Lockstep execution amortizes kernel launch overhead across the batch.
- CPU scaling is super-linear on large models — 10× toys → 20.3× time for 184 params due to L3 cache pressure.
- Crossover at ~100 parameters — GPU wins for models with ~100+ parameters. Below that, CPU Rayon parallelism dominates due to negligible per-toy overhead.
Recommendation: Use --gpu cuda or --gpu metal for batch toy workloads on models with 100+ parameters. Use CPU (default) for small models and single-model fits.
Metal Limitations
Current Metal Status
- ›Batch toy fitting — fully supported (f32 precision, ~1.27e-6 NLL parity vs CPU f64).
- ›Differentiable NLL + gradient — fully supported via CPU tensor bridge.
- ›Ranking — not yet supported. The server and CLI return a clear error. Use CPU for ranking on Apple Silicon.
Known Issues (Fixed)
- Batch toys memcpy_dtoh panic — cudarc 0.19 requires
dst.len() >= src.len()for device-to-host copies. When toys converge andn_active < max_batch, the host buffer was too small. Fix: allocate host buffers atmax_batchsize and truncate. - ProfiledDifferentiableSession convergence — L-BFGS-B tolerance 1e-6 too tight for projected gradient near parameter bounds. Fix: tolerance 1e-5 + NLL stability criterion.
Validation
# Single-model fit + gradient parity (CUDA) cargo test -p ns-inference --features cuda -- --nocapture # Metal batch tests cargo test -p ns-inference --features metal -- --nocapture # Python GPU parity pytest tests/python/test_gpu_parity.py -v
Rust integration tests in crates/ns-inference/src/gpu_single.rs:
- test_gpu_nll_matches_cpu
- test_gpu_grad_matches_cpu
- test_gpu_fit_matches_cpu
- test_gpu_session_reuse
- test_gpu_complex_workspace
- test_gpu_nll_and_grad_at_multiple_points
Tolerance source of truth: tests/python/_tolerances.py (Python) · crates/ns-inference/src/gpu_single.rs (Rust)
