Third-Party Replication: Signed Reports
If you've ever been burned by an "impressive benchmark", you already know the problem: benchmarks are not just measurements — they are claims.
And the only robust way to evaluate a claim is to replicate it.
Our public benchmark program treats third-party replication as a first-class feature, not a nice-to-have.
The canonical replication protocol is documented on the Public Benchmarks page.
Abstract. The strongest trust signal for a benchmark is not "more charts" or "more machines". It is an independent rerun. We operationalize replication as a publishable artifact set: a rerun snapshot_index.json (hashed artifact inventory), a machine-readable replication_report.json comparing original vs rerun, the rerun validation pack (including validation_report.json), and optional signatures for integrity and attribution.
1.Why replication is different from "more benchmarks"
We can publish more machines, more suites, more graphs — and still fail the trust test. Replication is qualitatively different because it adds:
- ›Independent hardware
- ›Independent operator errors (the realistic ones)
- ›Independent scrutiny of the harness and assumptions
If a benchmark can't survive an external rerun, it shouldn't be used as evidence.
2.What we mean by "replication"
A replication is not "I ran something similar". At minimum it means:
- ›Same suite definition
- ›Same dataset IDs
- ›Same harness version (or documented diffs)
- ›The same correctness gates still pass
- ›Raw outputs and a baseline manifest are published
The goal is that disagreements become diagnosable:
- ›Environment differences (compiler, BLAS, GPU driver)
- ›Dataset drift
- ›Harness changes
- ›Or a bug
3.The replication artifact set (what gets published)
Replication only works if outsiders can identify exactly what was compared. We publish two small, machine-readable "index" documents alongside the raw results:
- ›
snapshot_index.json(schemanextstat.snapshot_index.v1) — artifact paths + sizes + SHA-256 hashes - ›
replication_report.json(schemanextstat.replication_report.v1) — structured comparison (overlap count + mismatches)
Additionally, each validation pack includes:
- ›
validation_report.json— dataset fingerprint + environment + suite pass/fail summary (schemavalidation_report_v1) - ›
validation_pack_manifest.json— SHA-256 + sizes for core validation pack files
Validation pack entry point: Validation Report Artifacts.
nextstat validation-report \
--apex2 tmp/apex2_master_report.json \
--workspace workspace.json \
--out validation_report.json \
--pdf validation_report.pdf \
--deterministic4.Why signed reports
If replication reports matter, they must be attributable and tamper-resistant. A signed report is a lightweight way to guarantee:
- ›Who produced the report
- ›What snapshot it refers to
- ›That the published artifact hasn't been modified
The validation_report.json produced by nextstat validation-report already includes SHA-256 hashes for both the workspace and the Apex2 master report. Adding a GPG or Sigstore signature to that JSON creates a complete chain: data hash → validation result → signer identity.
We don't need bureaucracy. We need integrity.
5.Step-by-step: a minimal replication loop
The minimal loop is:
- ›1. Download the original snapshot artifacts (
snapshot_index.json,validation_pack_manifest.json,validation_report.json) - ›2. Verify original signatures (if provided)
- ›3. Rerun the suite to produce your own validation pack
- ›4. Write your rerun
snapshot_index.json - ›5. Generate a
replication_report.jsoncomparing original vs rerun - ›6. Sign and publish your replication bundle
This is intentionally designed to be mostly file operations, not "trust our interpretation".
6.What we will do with replications
Replications should not disappear in a comment thread. We plan to:
- ›Link replications directly from the snapshot index
- ›Use replicated numbers in public claims
- ›Prefer "rerun me" evidence over "trust us" language
7.The ask
If you care about reproducible scientific computing, the most valuable contribution is:
- ›Rerun a published snapshot on your hardware
- ›Publish your manifest + raw results
- ›Sign the report
That's how performance claims become community knowledge.
