Benchmark Snapshots as Products: CI Artifacts, Manifests, and Baselines
How to turn a one-off run into an immutable artifact set that can be downloaded, hash-verified, rerun, and compared.
2026-02-01 · 5 min read
Benchmarks are not just measurements — they are claims. If a claim is not rerunnable, it is not evidence. It's a screenshot.
This post explains the publishing layer of our public benchmark program: how we turn "we ran it once" into a benchmark snapshot that others can rerun, audit, and (eventually) cite.
Canonical specification (protocol + artifacts): Public Benchmarks.
Abstract. We treat a benchmark snapshot as a product artifact set, not a blog table: raw per-test measurements (distributions, not only medians), correctness gates (proof that the computation matches what we claim), pinned environments (so "install" means "same deps"), manifests + hashes (so "downloaded" means "unchanged"), and an index format for discovery and replication.
1.Definitions: what a "snapshot" is (and isn't)
A snapshot is an immutable set of files produced by a benchmark harness run. In our language:
- ›Harness — the code that runs benchmark workflows and writes outputs
- ›Snapshot ID — an opaque identifier (e.g.
snapshot-2026-02-08) that maps to exactly one artifact set - ›Raw results — per-test, per-repeat timings and correctness deltas
- ›Manifest — machine-readable "what was run on what" metadata
- ›Correctness gate — an explicit check that fails fast if results are inconsistent with the reference
- ›Deterministic mode — best-effort stable JSON/PDF output to support hashing and diffing
What a snapshot is not: a single "best run" number, a chart without the raw samples, or a benchmark that doesn't prove it computed the same model as the reference.
2.Snapshot anatomy: the minimum publishable artifact set
At minimum, each snapshot includes:
- ›Raw results (per test, per repeat)
- ›Summaries (tables/plots derived from raw data)
- ›Baseline manifest —
baseline_manifest.json(schemanextstat.baseline_manifest.v1): harness commit, NextStat version, environment, and the list ofdataset.id+dataset.sha256extracted from suite results. - ›Snapshot index —
snapshot_index.json(schemanextstat.snapshot_index.v1): artifact paths, sizes, and SHA-256 hashes. This is the "byte inventory" for download verification and cross-snapshot comparison. - ›Correctness gates (parity/sanity checks that validate the run)
- ›Pinned NextStat build (recommended): either
nextstat.wheel_sha256in the manifest, or the wheel file itself (nextstat_wheel.whl) inside the snapshot for DOI publication - ›Validation pack (optional, but useful for "evidence-grade" publication): a unified bundle for audit and signing.
- ›
validation_report.json(schemavalidation_report_v1) - ›Optional
validation_report.pdf - ›
validation_pack_manifest.json(SHA-256 + sizes for core files)
The single-command entrypoint for generating a complete validation pack:
make validation-packThis generates apex2_master_report.json + validation_report.json(and optional PDF) + validation_pack_manifest.json in tmp/validation_pack/.
See: Validation Report Artifacts.
3.Determinism: why hashing is a feature, not a nicety
In benchmark publishing, two properties matter:
- ›Immutability — a snapshot ID maps to a fixed artifact set
- ›Verifiability — outsiders can confirm they got the same bytes you published
That is why we invest in deterministic artifact generation:
- ›Strip timestamps where possible (e.g., in
validation_report.jsonwhen runningnextstat validation-report --deterministicthegenerated_atfield is set tonull) - ›Stable JSON key ordering
- ›Stable ordering for "set-like" arrays
- ›Fixture-driven deterministic PDF rendering for audit packs
Important: the snapshot index snapshot_index.json does contain generated_at as a timestamp, because it is a discovery/inventory format. Determinism is required primarily for artifacts intended for signing and re-verification as an "evidence pack".
In CI we treat determinism as an invariant: re-rendering the same inputs must produce bit-identical JSON/PDF and an identical validation_pack_manifest.json.
4.Why CI is the right publisher (and what CI does not solve)
Local benchmarks are useful, but they are not publication-grade evidence because:
- ›Environment drift is invisible
- ›Cache state is inconsistent
- ›Operator steps are undocumented
CI runs are better because the harness is automated, snapshots are consistent and indexed, and artifacts can be attached immutably. But CI does not solve hardware representativeness or guarantee that a CI environment matches production. That's why we also publish environment manifests and invite external replications.
5.Baselines: avoiding "moving targets"
Baselines serve two purposes: regression detection (did we get slower?) and trend analysis (how does performance change over time). But a baseline quickly becomes meaningless if it "drifts" and changes implicitly.
That is why we treat baselines as explicit, versioned references: "compare against snapshot X" — not "compare against whatever ran last week".
In the public benchmarks repository skeleton, baseline manifests are versioned JSON documents with a schema. For example (abbreviated):
{
"schema_version": "nextstat.baseline_manifest.v1",
"snapshot_id": "snapshot-2026-02-08",
"deterministic": true,
"harness": { "repo": "nextstat-public-benchmarks", "git_commit": "…" },
"nextstat": { "version": "0.9.0", "wheel_sha256": "…" },
"environment": { "python": "3.13.1", "platform": "Linux-6.8…" },
"datasets": [{ "id": "hep/simple_workspace.json", "sha256": "…" }],
"results": [{ "suite": "hep", "path": "out/hep_simple_nll.json", "sha256": "…" }]
}What matters is not the specific fields you choose, but that a baseline is a named, hashable reference.
6.Indexing: making snapshots discoverable (and comparable)
An artifact set that cannot be discovered and uniquely identified is not "public". We use a minimal JSON "snapshot index" format (schema nextstat.snapshot_index.v1) that links:
- ›Suite name
- ›Git SHA/ref
- ›Workflow metadata
- ›Artifact paths and SHA-256 hashes
This index is also the anchor for third-party replication: if you cannot point to what exactly was published, you cannot reproduce it.
6.1.Publishing a snapshot (seed harness)
In the public benchmarks reference repository, a snapshot is assembled by a single publisher command that runs suites, validates JSON schemas, and writes baseline_manifest.json and snapshot_index.json:
python3 benchmarks/nextstat-public-benchmarks/scripts/publish_snapshot.py \
--snapshot-id snapshot-2026-02-08 \
--out-root manifests/snapshots \
--deterministic \
--fitThis creates a directory manifests/snapshots/<snapshot_id>/ with suite artifacts, and validates suite JSON against schemas in manifests/schema/.
7.How an external participant verifies a snapshot (recipe)
A snapshot should be verifiable without trusting our blog post. The minimum verification loop:
- ›1) Download the published artifact set (raw results + manifests)
- ›2) Verify hashes (from
snapshot_index.jsonorvalidation_pack_manifest.json) - ›3) Validate JSON schemas (e.g.,
validation_report_v1) - ›4) Rerun the harness on your machine (same suite + same dataset IDs)
- ›5) Compare artifact hashes and/or semantic deltas with the original
The benchmark program is designed so that "verification" is mostly file operations, not subjective interpretation.
8.DOI and CITATION.cff: when benchmarks become citable evidence
When a snapshot is stable enough to reference in a paper or technical report, we publish it with a DOI (e.g., via Zenodo), add a CITATION.cff, and bind the DOI to the full artifact set — not a "screenshot of a table".
That's not paperwork — it's the difference between a marketing claim and a citable dataset.
Related reading
- ›Trust Offensive: Public Benchmarks — the why + the trust model
- ›Third-Party Replication: Signed Reports — external reruns as the strongest trust signal
- ›Public Benchmarks Specification — protocols, artifacts, and suite structure
- ›Validation Report Artifacts — the JSON + PDF system gating every published snapshot
