Public Benchmarks
NextStat's public benchmark program treats performance as a scientific claim — with protocols, pinned environments, correctness gates, and artifacts that anyone can rerun and audit.
The harness is open-source: github.com/NextStat/nextstat-public-benchmarks
What goes in docs vs blog
Docs (this site) are canonical: protocols, contracts, runbooks, and "how to rerun" instructions.
Blog posts are narrative: motivation, design rationale, interpretation of results, and "what this changes" framing.
Rule of thumb: if a reader needs to execute the benchmark, it belongs in docs. If a reader needs to understand why the benchmark program exists, it belongs in the blog.
Scope
We benchmark end-to-end user workflows, not isolated micro-kernels:
- HEP / HistFactory — NLL evaluation, gradients, MLE fits, profile scans, toy ensembles
- Pharma — PK/NLME likelihood evaluation + fitting loops
- Bayesian — gradient-based samplers (NUTS) with ESS/sec and wall-time
- ML infra — compilation vs execution time, differentiable pipeline throughput
- Time Series — Kalman filter/smoother throughput, EM convergence cost
- Econometrics — Panel FE, DiD, IV/2SLS scaling with cluster count
Trust Model
For every published snapshot you should be able to answer, from artifacts alone:
| # | Question |
|---|---|
| 1 | What was measured? (definition of tasks + metrics) |
| 2 | On what data? (dataset ID + hash + license) |
| 3 | Under what environment? (OS, CPU/GPU, compiler, Python, dependency versions) |
| 4 | From what code? (NextStat commit hash, dependency lockfiles, build flags) |
| 5 | Does it still match reference? (sanity/parity checks before timing) |
| 6 | How stable is the number? (repeat strategy, distributions, reporting) |
Reproducibility Contract
- rust-toolchain.toml + Cargo.lock (Rust toolchain + dependencies)
- Python version + dependency lock (uv / pip-tools / poetry)
- GPU runtime details when used (CUDA version / Metal / driver)
- Correctness gating before timing — harness validates parity within explicit tolerance
- Raw per-test measurements published, not only summaries
Published Artifacts
- Raw results (per test, per repeat)
- Summary tables (median/best-of-N policy explicit)
- Baseline manifest (code SHA, env versions, dataset hashes, run config)
- Correctness/parity report used as gating
- validation_report.json + optional validation_report.pdf (via nextstat validation-report)
Publishing Automation
Snapshot publishing is automated via scripts in the benchmarks repo:
publish_snapshot.py— generates a local snapshot directory with baseline_manifest.json, README_snippet.md, and schema-validates all artifactswrite_baseline_manifest.py— captures environment (Rust/Python toolchains, OS, CPU/GPU, dependency locks) into a schema-validated manifest; includes best-effort GPU inventory vianvidia-smiwhen availablereport.py --format markdown— renders human-readable summaries from raw results; writes README_snippet.md for inclusion in snapshot READMEs
Repo Skeleton
The public benchmarks repo follows a pinned-environment structure:
nextstat-public-benchmarks/
manifests/
schema/ # JSON schemas for all artifact types
snapshots/ # published snapshot manifests
suites/
hep/ # run.py, suite.py, baselines/
pharma/ # run.py, suite.py, baselines/
bayesian/ # run.py, suite.py
ml/ # suite.py, report.py
env/
docker/ # cpu.Dockerfile, cuda.Dockerfile
python/ # pyproject.toml + uv.lock
rust/ # rust-toolchain.toml + Cargo.lock
ci/
publish.yml # CI snapshot publishing (CPU)
publish_gpu.yml # CI snapshot publishing (self-hosted GPU runner)
verify.yml # CI correctness gateSuite Readiness
| Suite | Status | Notes |
|---|---|---|
| HEP | Runnable | pyhf harness + ROOT baseline template (schema-validated) |
| Pharma | Seed | run.py + suite.py + nlmixr2/Torsten templates (status=skipped, awaiting env) |
| Bayesian | Runnable | Multi-backend: NextStat + CmdStanPy + PyMC (smoke snapshot verified) |
| ML | Published | First GPU snapshot: RTX 4000 SFF Ada, JAX 0.4.38 CUDA, NextStat 0.1.0 |
| Time Series | Planned | Protocol defined |
| Econometrics | Planned | Protocol defined |
Suites
| Suite | Key Metrics |
|---|---|
| HEP | NLL time/call, gradient time, MLE fit wall-time, profile scan, toy throughput (CPU + GPU) |
| Pharma | PK/NLME likelihood + gradient time, fit wall-time, subject-count scaling |
| Bayesian | ESS/sec (bulk + tail), wall-time per effective draw, SBC calibration |
| ML | Cold-start latency, warm throughput, differentiable pipeline cost |
| Time Series | Kalman filter/smoother states/sec, EM convergence cost, forecast latency |
| Econometrics | Panel FE scaling, DiD wall-time, IV/2SLS cost, AIPW vs naive OLS |
Baselines (External Reference Implementations)
Baselines are schema-validated runner templates for external tools. They produce JSON results in the same format as NextStat runs, enabling apples-to-apples comparisons:
| Baseline | Suite | Status |
|---|---|---|
| ROOT/RooFit | HEP | Template + schema (status=skipped, awaiting env) |
| nlmixr2 (R) | Pharma | Template + schema (status=skipped, awaiting env) |
| Torsten (Stan) | Pharma | Template + schema (status=skipped, awaiting env) |
DOI + Citation
Stable benchmark snapshots are published with a DOI (Zenodo) and a machine-readable CITATION.cff. The DOI points to raw outputs, manifests, and the exact harness version.
First production record: DOI 10.5281/zenodo.18542624
zenodo.jsontemplate — pre-filled metadata for Zenodo API upload- Each snapshot gets a unique DOI that resolves to the full artifact set
- Machine-readable CITATION.cff for referencing benchmark datasets in papers
Third-Party Replication
The strongest trust signal is an independent rerun. A replication produces:
- A rerun log with the same harness
- The baseline manifest of the rerun environment
- A signed report comparing rerun vs published snapshot (GPG / Sigstore)
Tooling:
compare_snapshots.py— diff tool that compares two snapshot directories (datasets, hashes, timing distributions, correctness deltas)signed_report_template.md— pre-filled template for replication reports with environment manifest, summary deltas, and signature block
