NextStat Server

Self-Hosted GPU Inference API

A standalone HTTP server that exposes NextStat's statistical inference engine over a JSON REST API. Deploy on a GPU node and share it across your entire lab — no per-user CUDA setup, no Python environment headaches.

Architecture

┌─────────────┐    HTTP/JSON     ┌──────────────────────┐
│ Python      │ ◄──────────────► │  nextstat-server      │
│ thin client │    /v1/fit       │  (axum + tokio)       │
│ (httpx)     │    /v1/ranking   │                       │
└─────────────┘    /v1/batch/*   │  ┌─────────────────┐  │
                   /v1/models    │  │  Model Cache     │  │
  Jupyter / CI /   /v1/health    │  │  (LRU, SHA-256)  │  │
  Airflow / curl                 │  └─────────────────┘  │
                                 │  ┌─────────────────┐  │
                                 │  │  GPU Mutex Queue │  │
                                 │  │  CUDA / Metal    │  │
                                 │  └─────────────────┘  │
                                 └──────────────────────┘

Quick Start

# Build the server (CPU-only)
cargo build --release -p ns-server

# Start on port 3742 (default)
./target/release/nextstat-server

# With CUDA GPU
cargo build --release -p ns-server --features cuda
./target/release/nextstat-server --gpu cuda --port 8080

Python Client

The nextstat.remote module is a pure-Python thin client. It requires only httpx — no Rust, no CUDA, no compiled extensions.

pip install httpx

import nextstat.remote as remote

client = remote.connect("http://gpu-server:3742")

# Single fit
result = client.fit(workspace_json)
print(f"μ̂ = {result.bestfit[0]:.4f} ± {result.uncertainties[0]:.4f}")

# Model cache — upload once, fit many times without re-parsing
model_id = client.upload_model(workspace_json, name="my-analysis")
result = client.fit(model_id=model_id)  # ~4x faster

# Batch fit — multiple workspaces in one request
batch = client.batch_fit([ws1, ws2, ws3])
for r in batch.results:
    print(r.nll if r else "failed")

# Batch toys — GPU-accelerated pseudo-experiments
toys = client.batch_toys(workspace_json, n_toys=10_000, seed=42)
print(f"{toys.n_converged}/{toys.n_toys} converged in {toys.wall_time_s:.1f}s")

# Ranking
ranking = client.ranking(workspace_json)
for e in ranking.entries:
    print(f"  {e.name}: Δμ = {e.delta_mu_up:+.4f} / {e.delta_mu_down:+.4f}")

API Reference

POST /v1/fit

Maximum-likelihood fit. Auto-detects pyhf and HS3 workspace formats. Pass model_id instead of workspace to use a cached model.

# Request
{
  "workspace": { ... },  // pyhf or HS3 (or omit if model_id given)
  "model_id": "abc...",  // optional, from POST /v1/models
  "gpu": true             // optional, default true
}

# Response
{
  "parameter_names": ["mu", "bkg_norm"],
  "poi_index": 0,
  "bestfit": [1.17, -0.03],
  "uncertainties": [1.00, 0.97],
  "nll": 6.908,
  "twice_nll": 13.816,
  "converged": true,
  "n_iter": 4,
  "n_fev": 6,
  "n_gev": 10,
  "covariance": [1.00, -0.66, -0.66, 0.95],
  "device": "cuda",
  "wall_time_s": 0.002
}

POST /v1/ranking

Nuisance-parameter impact ranking, sorted by |Δμ| descending. Supports model_id. Metal GPU does not yet support ranking — the server returns HTTP 400 with a descriptive error. Use CUDA or CPU for ranking.

# Request
{
  "workspace": { ... },  // or "model_id": "abc..."
  "gpu": true
}

# Response
{
  "entries": [
    {
      "name": "bkg_norm",
      "delta_mu_up": -0.71,
      "delta_mu_down": 0.68,
      "pull": -0.026,
      "constraint": 0.975
    }
  ],
  "device": "cuda",
  "wall_time_s": 0.001
}

POST /v1/batch/fit

Fit up to 100 workspaces in a single request.

# Request
{ "workspaces": [{ ... }, { ... }], "gpu": true }

# Response
{
  "results": [
    { "index": 0, "bestfit": [...], "nll": 6.9, "converged": true, ... },
    { "index": 1, "error": "parse error: ..." }
  ],
  "device": "cpu",
  "wall_time_s": 0.005
}

POST /v1/batch/toys

GPU-accelerated batch toy fitting (CUDA, Metal, or CPU Rayon).

# Request
{
  "workspace": { ... },
  "params": [1.0, 0.0],   // optional, defaults to model init
  "n_toys": 1000,          // default 1000, max 100000
  "seed": 42,
  "gpu": true
}

# Response
{
  "n_toys": 1000,
  "n_converged": 998,
  "results": [{ "bestfit": [...], "nll": 7.1, "converged": true, "n_iter": 12 }, ...],
  "device": "cuda",
  "wall_time_s": 0.8
}

POST /v1/models

Upload a workspace to the model cache. Returns a SHA-256 model_id for use in fit/ranking.

# Request
{ "workspace": { ... }, "name": "my-analysis" }

# Response
{ "model_id": "1fb0d639...", "n_params": 250, "n_channels": 5, "cached": true }

GET /v1/models

List all cached models with metadata.

DELETE /v1/models/:id

Evict a model from the cache.

GET /v1/health

{
  "status": "ok",
  "version": "0.9.0",
  "uptime_s": 3600.5,
  "device": "cuda",
  "inflight": 2,
  "total_requests": 1547,
  "cached_models": 3
}

Server Options

Flag	Default	Description
--port	3742	Listen port
--host	0.0.0.0	Bind address
--gpu	none	"cuda" or "metal" (CPU if omitted)
--threads	0 (auto)	CPU thread count for non-GPU workloads

GPU Serialisation

The server accepts concurrent HTTP connections but serialises GPU access through a tokio::sync::Mutex. Only one fit or ranking runs on the GPU at a time; others queue. HTTP responses are non-blocking — the server can accept new requests while a GPU job is running.

Tool Runtime (Agent Surface)

The server mirrors nextstat.tools over HTTP, so agents can bootstrap tool definitions and execute tools without importing Python.

GET /v1/tools/schema

Returns OpenAI-compatible tool definitions. Envelope: schema_version = "nextstat.tool_schema.v1".

POST /v1/tools/execute

// Request
{
  "name": "nextstat_fit",
  "arguments": {
    "workspace_json": "{...}",
    "execution": { "deterministic": true }
  }
}

// Response (always tool envelope)
{
  "schema_version": "nextstat.tool_result.v1",
  "ok": true,
  "result": { ... },
  "error": null,
  "meta": { "tool_name": "nextstat_fit", "nextstat_version": "..." }
}

Determinism Notes

ns_compute::EvalMode is process-wide. To avoid races, the server serialises inference requests behind a global compute lock. Per-request execution.eval_mode is safe (no cross-request bleed), but total throughput is lower (one inference request at a time).

GPU policy: if execution.deterministic=true (default), tools run on CPU. If execution.deterministic=false and the server started with --gpu cuda|metal, some tools may use GPU.

Python Client (Server Transport)

from nextstat.tools import get_toolkit, execute_tool

server_url = "http://127.0.0.1:3742"
tools = get_toolkit(transport="server", server_url=server_url)

out = execute_tool(
    "nextstat_fit",
    {"workspace_json": "...", "execution": {"deterministic": True}},
    transport="server",
    server_url=server_url,
)
# server_url also via NEXTSTAT_SERVER_URL env var
# fallback_to_local=False to disable local fallback

Security / Input Policy

Server mode does not expose file-ingest tools (like reading ROOT files from arbitrary paths) via /v1/tools/*. If you need ROOT ingest for a demo agent, do it client-side and send derived data to the server.

Deployment

Docker

# CPU build
docker build -t nextstat-server -f crates/ns-server/Dockerfile .
docker run -p 3742:3742 nextstat-server

# CUDA build
docker build --build-arg FEATURES=cuda -t nextstat-server:cuda   -f crates/ns-server/Dockerfile .
docker run -p 3742:3742 --gpus all nextstat-server:cuda --gpu cuda

Helm (Kubernetes)

helm install nextstat-server crates/ns-server/helm/nextstat-server \
  --set server.gpu=cuda \
  --set gpu.enabled=true \
  --set image.tag=0.9.0

Systemd

[Unit]
Description=NextStat GPU inference server
After=network.target

[Service]
ExecStart=/usr/local/bin/nextstat-server --gpu cuda --port 3742
Restart=always
Environment=RUST_LOG=info

[Install]
WantedBy=multi-user.target