NextStatNextStat

Third-Party Replication: Signed Reports

/6 min
ReplicationTrustSigned ReportsBenchmarksGPGSigstore
Trust Offensive series:Index·Prev: Benchmark Snapshots·Next: HEP Benchmark Harness

If you've ever been burned by an "impressive benchmark", you already know the problem: benchmarks are not just measurements — they are claims.

And the only robust way to evaluate a claim is to replicate it.

Our public benchmark program treats third-party replication as a first-class feature, not a nice-to-have.

The canonical replication protocol is documented on the Public Benchmarks page.


Abstract. The strongest trust signal for a benchmark is not "more charts" or "more machines". It is an independent rerun. We operationalize replication as a publishable artifact set: a rerun snapshot_index.json (hashed artifact inventory), a machine-readable replication_report.json comparing original vs rerun, the rerun validation pack (including validation_report.json), and optional signatures for integrity and attribution.


1.Why replication is different from "more benchmarks"

We can publish more machines, more suites, more graphs — and still fail the trust test. Replication is qualitatively different because it adds:

  • Independent hardware
  • Independent operator errors (the realistic ones)
  • Independent scrutiny of the harness and assumptions

If a benchmark can't survive an external rerun, it shouldn't be used as evidence.


2.What we mean by "replication"

A replication is not "I ran something similar". At minimum it means:

  • Same suite definition
  • Same dataset IDs
  • Same harness version (or documented diffs)
  • The same correctness gates still pass
  • Raw outputs and a baseline manifest are published

The goal is that disagreements become diagnosable:

  • Environment differences (compiler, BLAS, GPU driver)
  • Dataset drift
  • Harness changes
  • Or a bug

3.The replication artifact set (what gets published)

Replication only works if outsiders can identify exactly what was compared. We publish two small, machine-readable "index" documents alongside the raw results:

  • snapshot_index.json (schema nextstat.snapshot_index.v1) — artifact paths + sizes + SHA-256 hashes
  • replication_report.json (schema nextstat.replication_report.v1) — structured comparison (overlap count + mismatches)

Additionally, each validation pack includes:

  • validation_report.json — dataset fingerprint + environment + suite pass/fail summary (schema validation_report_v1)
  • validation_pack_manifest.json — SHA-256 + sizes for core validation pack files

Validation pack entry point: Validation Report Artifacts.

nextstat validation-report \
  --apex2 tmp/apex2_master_report.json \
  --workspace workspace.json \
  --out validation_report.json \
  --pdf validation_report.pdf \
  --deterministic

4.Why signed reports

If replication reports matter, they must be attributable and tamper-resistant. A signed report is a lightweight way to guarantee:

  • Who produced the report
  • What snapshot it refers to
  • That the published artifact hasn't been modified

The validation_report.json produced by nextstat validation-report already includes SHA-256 hashes for both the workspace and the Apex2 master report. Adding a GPG or Sigstore signature to that JSON creates a complete chain: data hash → validation result → signer identity.

We don't need bureaucracy. We need integrity.


5.Step-by-step: a minimal replication loop

The minimal loop is:

  • 1. Download the original snapshot artifacts (snapshot_index.json, validation_pack_manifest.json, validation_report.json)
  • 2. Verify original signatures (if provided)
  • 3. Rerun the suite to produce your own validation pack
  • 4. Write your rerun snapshot_index.json
  • 5. Generate a replication_report.json comparing original vs rerun
  • 6. Sign and publish your replication bundle

This is intentionally designed to be mostly file operations, not "trust our interpretation".


6.What we will do with replications

Replications should not disappear in a comment thread. We plan to:

  • Link replications directly from the snapshot index
  • Use replicated numbers in public claims
  • Prefer "rerun me" evidence over "trust us" language

7.The ask

If you care about reproducible scientific computing, the most valuable contribution is:

  • Rerun a published snapshot on your hardware
  • Publish your manifest + raw results
  • Sign the report

That's how performance claims become community knowledge.