The End of the Scripting Era

2026-01-29/7 min

ReproducibilityBenchmarksScientific ComputingTrust

Trust Offensive series:Index·Prev: Trust Offensive·Next: Benchmark Snapshots

For years, performance “benchmarks” in scientific software were scripts: a notebook, an ad hoc dataset download, a chart, and a single number.

This is effective for exploration, but it is structurally weak evidence: it rarely produces immutable artifacts that can be rerun, diffed, audited, or replicated by outsiders.

That era is ending — not because scripts are bad, but because the cost of trusthas gone up. When your result is a number that drives decisions (limits, discoveries, claims, or compute budgets), it's not enough to be fast. You need to be reproducibly fast.

Abstract. The unit of publication is shifting from a single number to a benchmark snapshot: an immutable artifact set with pinned environments, correctness gates, manifests, and raw distributions. That shift changes how scientific software is built: determinism becomes a mode, correctness becomes a gate, and CI becomes a publisher.

Canonical specification (protocol + artifacts): Public Benchmarks.

1.The scripting failure mode: performance without a protocol

Most benchmark screenshots look convincing and are incomplete. They rarely answer:

›What exact inputs were used (and are they hash-identifiable)?
›What correctness checks were performed before timing?
›Does the metric include compilation, kernel loading, cache population?
›What warmup policy was applied?
›What toolchain versions and lockfiles were active?
›What flags, modes, and determinism settings were enabled?

Without these, a benchmark is not an experiment; it is a story.

2.Benchmarks as experiments: protocol invariants

In NextStat, we treat performance as a scientific claim. A claim requires a protocol with explicit invariants:

›Task definition — what computation is executed
›Metric definition — what is measured and what is excluded
›Correctness gates — what “correct” means (with tolerances)
›Environment pinning — toolchains, lockfiles, runtime versions
›Repeat strategy — distributions + explicit aggregation policy

This is why our benchmark program uses correctness gates before timing, pinned toolchains + dependency locks, raw result publishing, and baseline manifests. The canonical spec: Public Benchmarks.

3.The new unit: benchmark snapshots (artifact sets)

The key move is to make a benchmark run produce a publishable artifact set, not just a number. At minimum, a snapshot includes:

›Raw per-test/per-repeat measurements — so variance and aggregation policy are visible
›Correctness gate results — so "fast" implies "correct under a contract"
›Baseline manifest — versions, hardware, dataset hashes, flags
›Index with file hashes — so outsiders can verify the bytes they downloaded

Concretely, a snapshot directory looks like:

snapshots/2026-02-01/
  baseline_manifest.json      # env + versions + dataset hashes
  hep_suite_result.json       # raw per-case timings
  correctness_report.json     # NLL parity gates
  snapshot_index.json          # SHA-256 index of all files
  README_snippet.md           # human-readable summary

In NextStat, these artifacts are not informal. The key contracts are schema-versioned:

›Validation report pack. validation_report.json with schema_version = validation_report_v1, produced by nextstat validation-report.
›Snapshot index. Schema nextstat.snapshot_index.v1 inventories files with SHA-256 digests.
›Replication report. Schema nextstat.replication_report.v1 records digest mismatches for external reruns.

Validation pack entry point: Validation Report Artifacts.

nextstat validation-report   --apex2 tmp/apex2_master_report.json   --workspace workspace.json   --out validation_report.json   --pdf validation_report.pdf   --deterministic

Full publishing contract: Benchmark Snapshots as Products.

4.Why this matters in HEP-like pipelines

HEP-style inference pipelines have a property that breaks naive benchmarking:

You can be "fast" by not doing the same inference.

If the likelihood is off by a small but systematic amount (wrong interpolation, wrong constraints, wrong masks), the benchmark number is meaningless — because the computation changed.

"End of scripting era" means:

›You don't benchmark without a reference check
›You don't publish numbers without artifacts
›You don't accept "it seems close" as a contract

5.The deeper shift: software becomes a system

The shift is not about Rust vs Python, or compiled vs interpreted. It's that software becomes a system with:

›Deterministic modes (for parity and debugging)
›Fast modes (for production)
›Explicit tolerances (so correctness is measurable)
›Automation (so the same harness runs every time)

That is what replaces "a script" as your source of truth.

6.What to expect from our public benchmark snapshots

When we publish benchmark snapshots, the goal is that you can:

›Rerun the same suite on your machine
›See the same correctness gates
›Compare results with full context

If the number differs, you should be able to answer why: hardware/driver/compiler differences, different datasets, different reference implementations, or a bug.

That is progress: disagreement becomes diagnosable.

7.Replication is the endgame

If you want the strongest possible trust signal, you want an independent rerun with a publishable report — same suite definition, same dataset IDs, published hashes/manifests, and (optionally) signatures for tamper resistance.

In practice, the replication boundary is digest-based: external reruns compare the SHA-256 inventory in snapshot_index.json and record mismatches in anextstat.replication_report.v1 document.

Runbook: Third-Party Replication: Signed Reports.

8.The point of a "trust offensive"

Publishing benchmarks is not a marketing stunt. It's a commitment:

›To show our work
›To make it reproducible
›To invite replication

If we do this right, the conversation changes from:

"Are you really that fast?"

to:

"Here's the harness. Let's measure it together."

Trust Offensive: Public Benchmarks (Blog)Benchmark Snapshots as Products (Blog)Third-Party Replication: Signed Reports (Blog)Unbinned Event-Level Analysis (Blog)Compiler-Symbolic vs Hybrid-Neural GPU Fits (Blog)Public Benchmarks Specification Validation Report Artifacts