Benchmark Snapshots as Products: CI Artifacts, Manifests, and Baselines

How to turn a one-off run into an immutable artifact set that can be downloaded, hash-verified, rerun, and compared.

BenchmarksCIReproducibilityTrust

2026-02-01 · 5 min read

Trust Offensive series:Index·Prev: End of Scripting Era·Next: Third-Party Replication

Benchmarks are not just measurements — they are claims. If a claim is not rerunnable, it is not evidence. It's a screenshot.

This post explains the publishing layer of our public benchmark program: how we turn "we ran it once" into a benchmark snapshot that others can rerun, audit, and (eventually) cite.

Canonical specification (protocol + artifacts): Public Benchmarks.

Abstract. We treat a benchmark snapshot as a product artifact set, not a blog table: raw per-test measurements (distributions, not only medians), correctness gates (proof that the computation matches what we claim), pinned environments (so "install" means "same deps"), manifests + hashes (so "downloaded" means "unchanged"), and an index format for discovery and replication.

1.Definitions: what a "snapshot" is (and isn't)

A snapshot is an immutable set of files produced by a benchmark harness run. In our language:

›Harness — the code that runs benchmark workflows and writes outputs
›Snapshot ID — an opaque identifier (e.g. snapshot-2026-02-08) that maps to exactly one artifact set
›Raw results — per-test, per-repeat timings and correctness deltas
›Manifest — machine-readable "what was run on what" metadata
›Correctness gate — an explicit check that fails fast if results are inconsistent with the reference
›Deterministic mode — best-effort stable JSON/PDF output to support hashing and diffing

What a snapshot is not: a single "best run" number, a chart without the raw samples, or a benchmark that doesn't prove it computed the same model as the reference.

2.Snapshot anatomy: the minimum publishable artifact set

At minimum, each snapshot includes:

›Raw results (per test, per repeat)
›Summaries (tables/plots derived from raw data)
›Baseline manifest — baseline_manifest.json (schema nextstat.baseline_manifest.v1): harness commit, NextStat version, environment, and the list of dataset.id + dataset.sha256 extracted from suite results.
›Snapshot index — snapshot_index.json (schema nextstat.snapshot_index.v1): artifact paths, sizes, and SHA-256 hashes. This is the "byte inventory" for download verification and cross-snapshot comparison.
›Correctness gates (parity/sanity checks that validate the run)
›Pinned NextStat build (recommended): either nextstat.wheel_sha256 in the manifest, or the wheel file itself (nextstat_wheel.whl) inside the snapshot for DOI publication
›Validation pack (optional, but useful for "evidence-grade" publication): a unified bundle for audit and signing.

›validation_report.json (schema validation_report_v1)
›Optional validation_report.pdf
›validation_pack_manifest.json (SHA-256 + sizes for core files)

The single-command entrypoint for generating a complete validation pack:

bash

make validation-pack

This generates apex2_master_report.json + validation_report.json(and optional PDF) + validation_pack_manifest.json in tmp/validation_pack/.

See: Validation Report Artifacts.

3.Determinism: why hashing is a feature, not a nicety

In benchmark publishing, two properties matter:

›Immutability — a snapshot ID maps to a fixed artifact set
›Verifiability — outsiders can confirm they got the same bytes you published

That is why we invest in deterministic artifact generation:

›Strip timestamps where possible (e.g., in validation_report.json when running nextstat validation-report --deterministic the generated_at field is set to null)
›Stable JSON key ordering
›Stable ordering for "set-like" arrays
›Fixture-driven deterministic PDF rendering for audit packs

Important: the snapshot index snapshot_index.json does contain generated_at as a timestamp, because it is a discovery/inventory format. Determinism is required primarily for artifacts intended for signing and re-verification as an "evidence pack".

In CI we treat determinism as an invariant: re-rendering the same inputs must produce bit-identical JSON/PDF and an identical validation_pack_manifest.json.

4.Why CI is the right publisher (and what CI does not solve)

Local benchmarks are useful, but they are not publication-grade evidence because:

›Environment drift is invisible
›Cache state is inconsistent
›Operator steps are undocumented

CI runs are better because the harness is automated, snapshots are consistent and indexed, and artifacts can be attached immutably. But CI does not solve hardware representativeness or guarantee that a CI environment matches production. That's why we also publish environment manifests and invite external replications.

5.Baselines: avoiding "moving targets"

Baselines serve two purposes: regression detection (did we get slower?) and trend analysis (how does performance change over time). But a baseline quickly becomes meaningless if it "drifts" and changes implicitly.

That is why we treat baselines as explicit, versioned references: "compare against snapshot X" — not "compare against whatever ran last week".

In the public benchmarks repository skeleton, baseline manifests are versioned JSON documents with a schema. For example (abbreviated):

{
  "schema_version": "nextstat.baseline_manifest.v1",
  "snapshot_id": "snapshot-2026-02-08",
  "deterministic": true,
  "harness": { "repo": "nextstat-public-benchmarks", "git_commit": "…" },
  "nextstat": { "version": "0.9.0", "wheel_sha256": "…" },
  "environment": { "python": "3.13.1", "platform": "Linux-6.8…" },
  "datasets": [{ "id": "hep/simple_workspace.json", "sha256": "…" }],
  "results": [{ "suite": "hep", "path": "out/hep_simple_nll.json", "sha256": "…" }]
}

What matters is not the specific fields you choose, but that a baseline is a named, hashable reference.

6.Indexing: making snapshots discoverable (and comparable)

An artifact set that cannot be discovered and uniquely identified is not "public". We use a minimal JSON "snapshot index" format (schema nextstat.snapshot_index.v1) that links:

›Suite name
›Git SHA/ref
›Workflow metadata
›Artifact paths and SHA-256 hashes

This index is also the anchor for third-party replication: if you cannot point to what exactly was published, you cannot reproduce it.

6.1.Publishing a snapshot (seed harness)

In the public benchmarks reference repository, a snapshot is assembled by a single publisher command that runs suites, validates JSON schemas, and writes baseline_manifest.json and snapshot_index.json:

bash

python3 benchmarks/nextstat-public-benchmarks/scripts/publish_snapshot.py \
  --snapshot-id snapshot-2026-02-08 \
  --out-root manifests/snapshots \
  --deterministic \
  --fit

This creates a directory manifests/snapshots/<snapshot_id>/ with suite artifacts, and validates suite JSON against schemas in manifests/schema/.

7.How an external participant verifies a snapshot (recipe)

A snapshot should be verifiable without trusting our blog post. The minimum verification loop:

›1) Download the published artifact set (raw results + manifests)
›2) Verify hashes (from snapshot_index.json or validation_pack_manifest.json)
›3) Validate JSON schemas (e.g., validation_report_v1)
›4) Rerun the harness on your machine (same suite + same dataset IDs)
›5) Compare artifact hashes and/or semantic deltas with the original

The benchmark program is designed so that "verification" is mostly file operations, not subjective interpretation.

8.DOI and `CITATION.cff`: when benchmarks become citable evidence

When a snapshot is stable enough to reference in a paper or technical report, we publish it with a DOI (e.g., via Zenodo), add a CITATION.cff, and bind the DOI to the full artifact set — not a "screenshot of a table".

That's not paperwork — it's the difference between a marketing claim and a citable dataset.