Public Benchmarks

NextStat's public benchmark program treats performance as a scientific claim — with protocols, pinned environments, correctness gates, and artifacts that anyone can rerun and audit.

The harness is open-source: github.com/NextStat/nextstat-public-benchmarks

What goes in docs vs blog

Docs (this site) are canonical: protocols, contracts, runbooks, and "how to rerun" instructions.

Blog posts are narrative: motivation, design rationale, interpretation of results, and "what this changes" framing.

Rule of thumb: if a reader needs to execute the benchmark, it belongs in docs. If a reader needs to understand why the benchmark program exists, it belongs in the blog.

Scope

We benchmark end-to-end user workflows, not isolated micro-kernels:

HEP / HistFactory — NLL evaluation, gradients, MLE fits, profile scans, toy ensembles
Pharma — PK/NLME likelihood evaluation + fitting loops
Bayesian — gradient-based samplers (NUTS) with ESS/sec and wall-time
ML infra — compilation vs execution time, differentiable pipeline throughput
Time Series — Kalman filter/smoother throughput, EM convergence cost
Econometrics — Panel FE, DiD, IV/2SLS scaling with cluster count

Trust Model

For every published snapshot you should be able to answer, from artifacts alone:

#	Question
1	What was measured? (definition of tasks + metrics)
2	On what data? (dataset ID + hash + license)
3	Under what environment? (OS, CPU/GPU, compiler, Python, dependency versions)
4	From what code? (NextStat commit hash, dependency lockfiles, build flags)
5	Does it still match reference? (sanity/parity checks before timing)
6	How stable is the number? (repeat strategy, distributions, reporting)

Reproducibility Contract

rust-toolchain.toml + Cargo.lock (Rust toolchain + dependencies)
Python version + dependency lock (uv / pip-tools / poetry)
GPU runtime details when used (CUDA version / Metal / driver)
Correctness gating before timing — harness validates parity within explicit tolerance
Raw per-test measurements published, not only summaries

Published Artifacts

Raw results (per test, per repeat)
Summary tables (median/best-of-N policy explicit)
Baseline manifest (code SHA, env versions, dataset hashes, run config)
Correctness/parity report used as gating
validation_report.json + optional validation_report.pdf (via nextstat validation-report)

Publishing Automation

Snapshot publishing is automated via scripts in the benchmarks repo:

publish_snapshot.py — generates a local snapshot directory with baseline_manifest.json, README_snippet.md, and schema-validates all artifacts
write_baseline_manifest.py — captures environment (Rust/Python toolchains, OS, CPU/GPU, dependency locks) into a schema-validated manifest; includes best-effort GPU inventory via nvidia-smi when available
report.py --format markdown — renders human-readable summaries from raw results; writes README_snippet.md for inclusion in snapshot READMEs

Repo Skeleton

The public benchmarks repo follows a pinned-environment structure:

nextstat-public-benchmarks/
  manifests/
    schema/          # JSON schemas for all artifact types
    snapshots/       # published snapshot manifests
  suites/
    hep/             # run.py, suite.py, baselines/
    pharma/          # run.py, suite.py, baselines/
    bayesian/        # run.py, suite.py
    ml/              # suite.py, report.py
  env/
    docker/          # cpu.Dockerfile, cuda.Dockerfile
    python/          # pyproject.toml + uv.lock
    rust/            # rust-toolchain.toml + Cargo.lock
  ci/
    publish.yml      # CI snapshot publishing (CPU)
    publish_gpu.yml  # CI snapshot publishing (self-hosted GPU runner)
    verify.yml       # CI correctness gate

Suite Readiness

Suite	Status	Notes
HEP	Runnable	pyhf harness + ROOT baseline template (schema-validated)
Pharma	Seed	run.py + suite.py + nlmixr2/Torsten templates (status=skipped, awaiting env)
Bayesian	Runnable	Multi-backend: NextStat + CmdStanPy + PyMC (smoke snapshot verified)
ML	Published	First GPU snapshot: RTX 4000 SFF Ada, JAX 0.4.38 CUDA, NextStat 0.1.0
Time Series	Planned	Protocol defined
Econometrics	Planned	Protocol defined

Suites

Suite	Key Metrics
HEP	NLL time/call, gradient time, MLE fit wall-time, profile scan, toy throughput (CPU + GPU)
Pharma	PK/NLME likelihood + gradient time, fit wall-time, subject-count scaling
Bayesian	ESS/sec (bulk + tail), wall-time per effective draw, SBC calibration
ML	Cold-start latency, warm throughput, differentiable pipeline cost
Time Series	Kalman filter/smoother states/sec, EM convergence cost, forecast latency
Econometrics	Panel FE scaling, DiD wall-time, IV/2SLS cost, AIPW vs naive OLS

Baselines (External Reference Implementations)

Baselines are schema-validated runner templates for external tools. They produce JSON results in the same format as NextStat runs, enabling apples-to-apples comparisons:

Baseline	Suite	Status
ROOT/RooFit	HEP	Template + schema (status=skipped, awaiting env)
nlmixr2 (R)	Pharma	Template + schema (status=skipped, awaiting env)
Torsten (Stan)	Pharma	Template + schema (status=skipped, awaiting env)

DOI + Citation

Stable benchmark snapshots are published with a DOI (Zenodo) and a machine-readable CITATION.cff. The DOI points to raw outputs, manifests, and the exact harness version.

First production record: DOI 10.5281/zenodo.18542624

zenodo.json template — pre-filled metadata for Zenodo API upload
Each snapshot gets a unique DOI that resolves to the full artifact set
Machine-readable CITATION.cff for referencing benchmark datasets in papers

Third-Party Replication

The strongest trust signal is an independent rerun. A replication produces:

A rerun log with the same harness
The baseline manifest of the rerun environment
A signed report comparing rerun vs published snapshot (GPG / Sigstore)

Tooling:

compare_snapshots.py — diff tool that compares two snapshot directories (datasets, hashes, timing distributions, correctness deltas)
signed_report_template.md — pre-filled template for replication reports with environment manifest, summary deltas, and signature block

Blog Posts

Validation Report Report System GPU Acceleration