For years, performance “benchmarks” in scientific software were scripts: a notebook, an ad hoc dataset download, a chart, and a single number.
This is effective for exploration, but it is structurally weak evidence: it rarely produces immutable artifacts that can be rerun, diffed, audited, or replicated by outsiders.
That era is ending — not because scripts are bad, but because the cost of trusthas gone up. When your result is a number that drives decisions (limits, discoveries, claims, or compute budgets), it's not enough to be fast. You need to be reproducibly fast.
Abstract. The unit of publication is shifting from a single number to a benchmark snapshot: an immutable artifact set with pinned environments, correctness gates, manifests, and raw distributions. That shift changes how scientific software is built: determinism becomes a mode, correctness becomes a gate, and CI becomes a publisher.
Canonical specification (protocol + artifacts): Public Benchmarks.
1.The scripting failure mode: performance without a protocol
Most benchmark screenshots look convincing and are incomplete. They rarely answer:
- ›What exact inputs were used (and are they hash-identifiable)?
- ›What correctness checks were performed before timing?
- ›Does the metric include compilation, kernel loading, cache population?
- ›What warmup policy was applied?
- ›What toolchain versions and lockfiles were active?
- ›What flags, modes, and determinism settings were enabled?
Without these, a benchmark is not an experiment; it is a story.
2.Benchmarks as experiments: protocol invariants
In NextStat, we treat performance as a scientific claim. A claim requires a protocol with explicit invariants:
- ›Task definition — what computation is executed
- ›Metric definition — what is measured and what is excluded
- ›Correctness gates — what “correct” means (with tolerances)
- ›Environment pinning — toolchains, lockfiles, runtime versions
- ›Repeat strategy — distributions + explicit aggregation policy
This is why our benchmark program uses correctness gates before timing, pinned toolchains + dependency locks, raw result publishing, and baseline manifests. The canonical spec: Public Benchmarks.
3.The new unit: benchmark snapshots (artifact sets)
The key move is to make a benchmark run produce a publishable artifact set, not just a number. At minimum, a snapshot includes:
- ›Raw per-test/per-repeat measurements — so variance and aggregation policy are visible
- ›Correctness gate results — so "fast" implies "correct under a contract"
- ›Baseline manifest — versions, hardware, dataset hashes, flags
- ›Index with file hashes — so outsiders can verify the bytes they downloaded
Concretely, a snapshot directory looks like:
snapshots/2026-02-01/
baseline_manifest.json # env + versions + dataset hashes
hep_suite_result.json # raw per-case timings
correctness_report.json # NLL parity gates
snapshot_index.json # SHA-256 index of all files
README_snippet.md # human-readable summaryIn NextStat, these artifacts are not informal. The key contracts are schema-versioned:
- ›Validation report pack.
validation_report.jsonwithschema_version = validation_report_v1, produced bynextstat validation-report. - ›Snapshot index. Schema
nextstat.snapshot_index.v1inventories files with SHA-256 digests. - ›Replication report. Schema
nextstat.replication_report.v1records digest mismatches for external reruns.
Validation pack entry point: Validation Report Artifacts.
nextstat validation-report --apex2 tmp/apex2_master_report.json --workspace workspace.json --out validation_report.json --pdf validation_report.pdf --deterministicFull publishing contract: Benchmark Snapshots as Products.
4.Why this matters in HEP-like pipelines
HEP-style inference pipelines have a property that breaks naive benchmarking:
You can be "fast" by not doing the same inference.
If the likelihood is off by a small but systematic amount (wrong interpolation, wrong constraints, wrong masks), the benchmark number is meaningless — because the computation changed.
"End of scripting era" means:
- ›You don't benchmark without a reference check
- ›You don't publish numbers without artifacts
- ›You don't accept "it seems close" as a contract
5.The deeper shift: software becomes a system
The shift is not about Rust vs Python, or compiled vs interpreted. It's that software becomes a system with:
- ›Deterministic modes (for parity and debugging)
- ›Fast modes (for production)
- ›Explicit tolerances (so correctness is measurable)
- ›Automation (so the same harness runs every time)
That is what replaces "a script" as your source of truth.
6.What to expect from our public benchmark snapshots
When we publish benchmark snapshots, the goal is that you can:
- ›Rerun the same suite on your machine
- ›See the same correctness gates
- ›Compare results with full context
If the number differs, you should be able to answer why: hardware/driver/compiler differences, different datasets, different reference implementations, or a bug.
That is progress: disagreement becomes diagnosable.
7.Replication is the endgame
If you want the strongest possible trust signal, you want an independent rerun with a publishable report — same suite definition, same dataset IDs, published hashes/manifests, and (optionally) signatures for tamper resistance.
In practice, the replication boundary is digest-based: external reruns compare the SHA-256 inventory in snapshot_index.json and record mismatches in anextstat.replication_report.v1 document.
Runbook: Third-Party Replication: Signed Reports.
8.The point of a "trust offensive"
Publishing benchmarks is not a marketing stunt. It's a commitment:
- ›To show our work
- ›To make it reproducible
- ›To invite replication
If we do this right, the conversation changes from:
"Are you really that fast?"
to:
"Here's the harness. Let's measure it together."
