Third-Party Replication: Signed Reports

2026-02-03/6 min

ReplicationTrustSigned ReportsBenchmarksGPGSigstore

Trust Offensive series:Index·Prev: Benchmark Snapshots·Next: HEP Benchmark Harness

If you've ever been burned by an "impressive benchmark", you already know the problem: benchmarks are not just measurements — they are claims.

And the only robust way to evaluate a claim is to replicate it.

Our public benchmark program treats third-party replication as a first-class feature, not a nice-to-have.

The canonical replication protocol is documented on the Public Benchmarks page.

Abstract. The strongest trust signal for a benchmark is not "more charts" or "more machines". It is an independent rerun. We operationalize replication as a publishable artifact set: a rerun snapshot_index.json (hashed artifact inventory), a machine-readable replication_report.json comparing original vs rerun, the rerun validation pack (including validation_report.json), and optional signatures for integrity and attribution.

1.Why replication is different from "more benchmarks"

We can publish more machines, more suites, more graphs — and still fail the trust test. Replication is qualitatively different because it adds:

›Independent hardware
›Independent operator errors (the realistic ones)
›Independent scrutiny of the harness and assumptions

If a benchmark can't survive an external rerun, it shouldn't be used as evidence.

2.What we mean by "replication"

A replication is not "I ran something similar". At minimum it means:

›Same suite definition
›Same dataset IDs
›Same harness version (or documented diffs)
›The same correctness gates still pass
›Raw outputs and a baseline manifest are published

The goal is that disagreements become diagnosable:

›Environment differences (compiler, BLAS, GPU driver)
›Dataset drift
›Harness changes
›Or a bug

3.The replication artifact set (what gets published)

Replication only works if outsiders can identify exactly what was compared. We publish two small, machine-readable "index" documents alongside the raw results:

›snapshot_index.json (schema nextstat.snapshot_index.v1) — artifact paths + sizes + SHA-256 hashes
›replication_report.json (schema nextstat.replication_report.v1) — structured comparison (overlap count + mismatches)

Additionally, each validation pack includes:

›validation_report.json — dataset fingerprint + environment + suite pass/fail summary (schema validation_report_v1)
›validation_pack_manifest.json — SHA-256 + sizes for core validation pack files

Validation pack entry point: Validation Report Artifacts.

nextstat validation-report \
  --apex2 tmp/apex2_master_report.json \
  --workspace workspace.json \
  --out validation_report.json \
  --pdf validation_report.pdf \
  --deterministic

4.Why signed reports

If replication reports matter, they must be attributable and tamper-resistant. A signed report is a lightweight way to guarantee:

›Who produced the report
›What snapshot it refers to
›That the published artifact hasn't been modified

The validation_report.json produced by nextstat validation-report already includes SHA-256 hashes for both the workspace and the Apex2 master report. Adding a GPG or Sigstore signature to that JSON creates a complete chain: data hash → validation result → signer identity.

We don't need bureaucracy. We need integrity.

5.Step-by-step: a minimal replication loop

The minimal loop is:

›1. Download the original snapshot artifacts (snapshot_index.json, validation_pack_manifest.json, validation_report.json)
›2. Verify original signatures (if provided)
›3. Rerun the suite to produce your own validation pack
›4. Write your rerun snapshot_index.json
›5. Generate a replication_report.json comparing original vs rerun
›6. Sign and publish your replication bundle

This is intentionally designed to be mostly file operations, not "trust our interpretation".

6.What we will do with replications

Replications should not disappear in a comment thread. We plan to:

›Link replications directly from the snapshot index
›Use replicated numbers in public claims
›Prefer "rerun me" evidence over "trust us" language

7.The ask

If you care about reproducible scientific computing, the most valuable contribution is:

›Rerun a published snapshot on your hardware
›Publish your manifest + raw results
›Sign the report

That's how performance claims become community knowledge.

Trust Offensive: Public Benchmarks (Blog)The End of the Scripting Era (Blog)Unbinned Event-Level Analysis (Blog)Compiler-Symbolic vs Hybrid-Neural GPU Fits (Blog)Public Benchmarks Specification Validation Report Artifacts