NextStat Server
Self-Hosted GPU Inference API
A standalone HTTP server that exposes NextStat's statistical inference engine over a JSON REST API. Deploy on a GPU node and share it across your entire lab — no per-user CUDA setup, no Python environment headaches.
Architecture
┌─────────────┐ HTTP/JSON ┌──────────────────────┐
│ Python │ ◄──────────────► │ nextstat-server │
│ thin client │ /v1/fit │ (axum + tokio) │
│ (httpx) │ /v1/ranking │ │
└─────────────┘ /v1/batch/* │ ┌─────────────────┐ │
/v1/models │ │ Model Cache │ │
Jupyter / CI / /v1/health │ │ (LRU, SHA-256) │ │
Airflow / curl │ └─────────────────┘ │
│ ┌─────────────────┐ │
│ │ GPU Mutex Queue │ │
│ │ CUDA / Metal │ │
│ └─────────────────┘ │
└──────────────────────┘Quick Start
# Build the server (CPU-only)
cargo build --release -p ns-server
# Start on port 3742 (default)
./target/release/nextstat-server
# With CUDA GPU
cargo build --release -p ns-server --features cuda
./target/release/nextstat-server --gpu cuda --port 8080Python Client
The nextstat.remote module is a pure-Python thin client. It requires only httpx — no Rust, no CUDA, no compiled extensions.
pip install httpx
import nextstat.remote as remote
client = remote.connect("http://gpu-server:3742")
# Single fit
result = client.fit(workspace_json)
print(f"μ̂ = {result.bestfit[0]:.4f} ± {result.uncertainties[0]:.4f}")
# Model cache — upload once, fit many times without re-parsing
model_id = client.upload_model(workspace_json, name="my-analysis")
result = client.fit(model_id=model_id) # ~4x faster
# Batch fit — multiple workspaces in one request
batch = client.batch_fit([ws1, ws2, ws3])
for r in batch.results:
print(r.nll if r else "failed")
# Batch toys — GPU-accelerated pseudo-experiments
toys = client.batch_toys(workspace_json, n_toys=10_000, seed=42)
print(f"{toys.n_converged}/{toys.n_toys} converged in {toys.wall_time_s:.1f}s")
# Ranking
ranking = client.ranking(workspace_json)
for e in ranking.entries:
print(f" {e.name}: Δμ = {e.delta_mu_up:+.4f} / {e.delta_mu_down:+.4f}")API Reference
POST /v1/fit
Maximum-likelihood fit. Auto-detects pyhf and HS3 workspace formats. Pass model_id instead of workspace to use a cached model.
# Request
{
"workspace": { ... }, // pyhf or HS3 (or omit if model_id given)
"model_id": "abc...", // optional, from POST /v1/models
"gpu": true // optional, default true
}
# Response
{
"parameter_names": ["mu", "bkg_norm"],
"poi_index": 0,
"bestfit": [1.17, -0.03],
"uncertainties": [1.00, 0.97],
"nll": 6.908,
"twice_nll": 13.816,
"converged": true,
"n_iter": 4,
"n_fev": 6,
"n_gev": 10,
"covariance": [1.00, -0.66, -0.66, 0.95],
"device": "cuda",
"wall_time_s": 0.002
}POST /v1/ranking
Nuisance-parameter impact ranking, sorted by |Δμ| descending. Supports model_id. Metal GPU does not yet support ranking — the server returns HTTP 400 with a descriptive error. Use CUDA or CPU for ranking.
# Request
{
"workspace": { ... }, // or "model_id": "abc..."
"gpu": true
}
# Response
{
"entries": [
{
"name": "bkg_norm",
"delta_mu_up": -0.71,
"delta_mu_down": 0.68,
"pull": -0.026,
"constraint": 0.975
}
],
"device": "cuda",
"wall_time_s": 0.001
}POST /v1/batch/fit
Fit up to 100 workspaces in a single request.
# Request
{ "workspaces": [{ ... }, { ... }], "gpu": true }
# Response
{
"results": [
{ "index": 0, "bestfit": [...], "nll": 6.9, "converged": true, ... },
{ "index": 1, "error": "parse error: ..." }
],
"device": "cpu",
"wall_time_s": 0.005
}POST /v1/batch/toys
GPU-accelerated batch toy fitting (CUDA, Metal, or CPU Rayon).
# Request
{
"workspace": { ... },
"params": [1.0, 0.0], // optional, defaults to model init
"n_toys": 1000, // default 1000, max 100000
"seed": 42,
"gpu": true
}
# Response
{
"n_toys": 1000,
"n_converged": 998,
"results": [{ "bestfit": [...], "nll": 7.1, "converged": true, "n_iter": 12 }, ...],
"device": "cuda",
"wall_time_s": 0.8
}POST /v1/models
Upload a workspace to the model cache. Returns a SHA-256 model_id for use in fit/ranking.
# Request
{ "workspace": { ... }, "name": "my-analysis" }
# Response
{ "model_id": "1fb0d639...", "n_params": 250, "n_channels": 5, "cached": true }GET /v1/models
List all cached models with metadata.
DELETE /v1/models/:id
Evict a model from the cache.
GET /v1/health
{
"status": "ok",
"version": "0.9.0",
"uptime_s": 3600.5,
"device": "cuda",
"inflight": 2,
"total_requests": 1547,
"cached_models": 3
}Server Options
| Flag | Default | Description |
|---|---|---|
| --port | 3742 | Listen port |
| --host | 0.0.0.0 | Bind address |
| --gpu | none | "cuda" or "metal" (CPU if omitted) |
| --threads | 0 (auto) | CPU thread count for non-GPU workloads |
GPU Serialisation
The server accepts concurrent HTTP connections but serialises GPU access through a tokio::sync::Mutex. Only one fit or ranking runs on the GPU at a time; others queue. HTTP responses are non-blocking — the server can accept new requests while a GPU job is running.
Tool Runtime (Agent Surface)
The server mirrors nextstat.tools over HTTP, so agents can bootstrap tool definitions and execute tools without importing Python.
GET /v1/tools/schema
Returns OpenAI-compatible tool definitions. Envelope: schema_version = "nextstat.tool_schema.v1".
POST /v1/tools/execute
// Request
{
"name": "nextstat_fit",
"arguments": {
"workspace_json": "{...}",
"execution": { "deterministic": true }
}
}
// Response (always tool envelope)
{
"schema_version": "nextstat.tool_result.v1",
"ok": true,
"result": { ... },
"error": null,
"meta": { "tool_name": "nextstat_fit", "nextstat_version": "..." }
}Determinism Notes
ns_compute::EvalMode is process-wide. To avoid races, the server serialises inference requests behind a global compute lock. Per-request execution.eval_mode is safe (no cross-request bleed), but total throughput is lower (one inference request at a time).
GPU policy: if execution.deterministic=true (default), tools run on CPU. If execution.deterministic=false and the server started with --gpu cuda|metal, some tools may use GPU.
Python Client (Server Transport)
from nextstat.tools import get_toolkit, execute_tool
server_url = "http://127.0.0.1:3742"
tools = get_toolkit(transport="server", server_url=server_url)
out = execute_tool(
"nextstat_fit",
{"workspace_json": "...", "execution": {"deterministic": True}},
transport="server",
server_url=server_url,
)
# server_url also via NEXTSTAT_SERVER_URL env var
# fallback_to_local=False to disable local fallbackSecurity / Input Policy
Server mode does not expose file-ingest tools (like reading ROOT files from arbitrary paths) via /v1/tools/*. If you need ROOT ingest for a demo agent, do it client-side and send derived data to the server.
Deployment
Docker
# CPU build
docker build -t nextstat-server -f crates/ns-server/Dockerfile .
docker run -p 3742:3742 nextstat-server
# CUDA build
docker build --build-arg FEATURES=cuda -t nextstat-server:cuda -f crates/ns-server/Dockerfile .
docker run -p 3742:3742 --gpus all nextstat-server:cuda --gpu cudaHelm (Kubernetes)
helm install nextstat-server crates/ns-server/helm/nextstat-server \
--set server.gpu=cuda \
--set gpu.enabled=true \
--set image.tag=0.9.0Systemd
[Unit]
Description=NextStat GPU inference server
After=network.target
[Service]
ExecStart=/usr/local/bin/nextstat-server --gpu cuda --port 3742
Restart=always
Environment=RUST_LOG=info
[Install]
WantedBy=multi-user.target