NextStatNextStat

Public Benchmarks

NextStat's public benchmark program treats performance as a scientific claim — with protocols, pinned environments, correctness gates, and artifacts that anyone can rerun and audit.

The harness is open-source: github.com/NextStat/nextstat-public-benchmarks

What goes in docs vs blog

Docs (this site) are canonical: protocols, contracts, runbooks, and "how to rerun" instructions.

Blog posts are narrative: motivation, design rationale, interpretation of results, and "what this changes" framing.

Rule of thumb: if a reader needs to execute the benchmark, it belongs in docs. If a reader needs to understand why the benchmark program exists, it belongs in the blog.

Scope

We benchmark end-to-end user workflows, not isolated micro-kernels:

  • HEP / HistFactory — NLL evaluation, gradients, MLE fits, profile scans, toy ensembles
  • Pharma — PK/NLME likelihood evaluation + fitting loops
  • Bayesian — gradient-based samplers (NUTS) with ESS/sec and wall-time
  • ML infra — compilation vs execution time, differentiable pipeline throughput
  • Time Series — Kalman filter/smoother throughput, EM convergence cost
  • Econometrics — Panel FE, DiD, IV/2SLS scaling with cluster count

Trust Model

For every published snapshot you should be able to answer, from artifacts alone:

#Question
1What was measured? (definition of tasks + metrics)
2On what data? (dataset ID + hash + license)
3Under what environment? (OS, CPU/GPU, compiler, Python, dependency versions)
4From what code? (NextStat commit hash, dependency lockfiles, build flags)
5Does it still match reference? (sanity/parity checks before timing)
6How stable is the number? (repeat strategy, distributions, reporting)

Reproducibility Contract

  • rust-toolchain.toml + Cargo.lock (Rust toolchain + dependencies)
  • Python version + dependency lock (uv / pip-tools / poetry)
  • GPU runtime details when used (CUDA version / Metal / driver)
  • Correctness gating before timing — harness validates parity within explicit tolerance
  • Raw per-test measurements published, not only summaries

Published Artifacts

  • Raw results (per test, per repeat)
  • Summary tables (median/best-of-N policy explicit)
  • Baseline manifest (code SHA, env versions, dataset hashes, run config)
  • Correctness/parity report used as gating
  • validation_report.json + optional validation_report.pdf (via nextstat validation-report)

Publishing Automation

Snapshot publishing is automated via scripts in the benchmarks repo:

  • publish_snapshot.py — generates a local snapshot directory with baseline_manifest.json, README_snippet.md, and schema-validates all artifacts
  • write_baseline_manifest.py — captures environment (Rust/Python toolchains, OS, CPU/GPU, dependency locks) into a schema-validated manifest; includes best-effort GPU inventory via nvidia-smi when available
  • report.py --format markdown — renders human-readable summaries from raw results; writes README_snippet.md for inclusion in snapshot READMEs

Repo Skeleton

The public benchmarks repo follows a pinned-environment structure:

nextstat-public-benchmarks/
  manifests/
    schema/          # JSON schemas for all artifact types
    snapshots/       # published snapshot manifests
  suites/
    hep/             # run.py, suite.py, baselines/
    pharma/          # run.py, suite.py, baselines/
    bayesian/        # run.py, suite.py
    ml/              # suite.py, report.py
  env/
    docker/          # cpu.Dockerfile, cuda.Dockerfile
    python/          # pyproject.toml + uv.lock
    rust/            # rust-toolchain.toml + Cargo.lock
  ci/
    publish.yml      # CI snapshot publishing (CPU)
    publish_gpu.yml  # CI snapshot publishing (self-hosted GPU runner)
    verify.yml       # CI correctness gate

Suite Readiness

SuiteStatusNotes
HEPRunnablepyhf harness + ROOT baseline template (schema-validated)
PharmaSeedrun.py + suite.py + nlmixr2/Torsten templates (status=skipped, awaiting env)
BayesianRunnableMulti-backend: NextStat + CmdStanPy + PyMC (smoke snapshot verified)
MLPublishedFirst GPU snapshot: RTX 4000 SFF Ada, JAX 0.4.38 CUDA, NextStat 0.1.0
Time SeriesPlannedProtocol defined
EconometricsPlannedProtocol defined

Suites

SuiteKey Metrics
HEPNLL time/call, gradient time, MLE fit wall-time, profile scan, toy throughput (CPU + GPU)
PharmaPK/NLME likelihood + gradient time, fit wall-time, subject-count scaling
BayesianESS/sec (bulk + tail), wall-time per effective draw, SBC calibration
MLCold-start latency, warm throughput, differentiable pipeline cost
Time SeriesKalman filter/smoother states/sec, EM convergence cost, forecast latency
EconometricsPanel FE scaling, DiD wall-time, IV/2SLS cost, AIPW vs naive OLS

Baselines (External Reference Implementations)

Baselines are schema-validated runner templates for external tools. They produce JSON results in the same format as NextStat runs, enabling apples-to-apples comparisons:

BaselineSuiteStatus
ROOT/RooFitHEPTemplate + schema (status=skipped, awaiting env)
nlmixr2 (R)PharmaTemplate + schema (status=skipped, awaiting env)
Torsten (Stan)PharmaTemplate + schema (status=skipped, awaiting env)

DOI + Citation

Stable benchmark snapshots are published with a DOI (Zenodo) and a machine-readable CITATION.cff. The DOI points to raw outputs, manifests, and the exact harness version.

First production record: DOI 10.5281/zenodo.18542624

  • zenodo.json template — pre-filled metadata for Zenodo API upload
  • Each snapshot gets a unique DOI that resolves to the full artifact set
  • Machine-readable CITATION.cff for referencing benchmark datasets in papers

Third-Party Replication

The strongest trust signal is an independent rerun. A replication produces:

  • A rerun log with the same harness
  • The baseline manifest of the rerun environment
  • A signed report comparing rerun vs published snapshot (GPG / Sigstore)

Tooling:

  • compare_snapshots.py — diff tool that compares two snapshot directories (datasets, hashes, timing distributions, correctness deltas)
  • signed_report_template.md — pre-filled template for replication reports with environment manifest, summary deltas, and signature block

Blog Posts