Pharma Benchmarks: PK and NLME

Without Benchmark Theater — objective definitions, stopping rules, scaling protocols, correctness gates.

PharmacometricsPKNLMEBenchmarksRegulated

2026-02-08 · 8 min read

Trust Offensive series:Index·Prev: Bayesian Benchmarks·Next: JAX Compile vs Execution

Pharmacometrics benchmarks are deceptively easy to do wrong. Two fitters can both return "a result" while measuring fundamentally different things: different objectives (MAP vs marginal likelihood vs FOCE-style approximations), different stopping rules and tolerances, different parameterizations, different handling of censoring / LLOQ.

Protocol and artifacts: Public Benchmarks. Validation pack: Validation Report. Suite runbook (repo path): docs/benchmarks/suites/pharma.md.

Abstract. We treat performance as evidence, not a screenshot. For PK/NLME that means: defining the objective precisely (likelihood, constraints, censoring policy), defining the fit protocol (stopping rule, bounds, initialization), publishing correctness gates (analytic checks + recovery on synthetic data), and publishing raw measurements + manifests so outsiders can rerun.

1.Threat model: how pharma benchmarks lie

Common failure modes that invalidate comparisons:

›Objective mismatch — MAP vs marginal likelihood vs FOCE/Laplace approximations
›Solver mismatch — different ODE solvers, tolerances, or step controls
›Parameterization mismatch — log-space vs linear, constrained vs unconstrained
›Censoring policy mismatch — LLOQ, censored likelihood vs imputation vs drop
›Convergence mismatch — different tolerances, line search rules, max-iter caps
›Dataset handling mismatch — preprocessing drift, unit conventions, time grids

If we can't align these, we don't call it a benchmark comparison — we call it two different experiments.

2.What we benchmark

›NLL and gradient evaluation (time/call)
›Fit wall-time under an explicit protocol
›Scaling laws with subject count and observations per subject

3.Correctness gates: analytic checks + recovery before timing

The Apex2 pharma reference runner produces deterministic, machine-readable evidence for:

›PK analytic correctness (closed-form 1-compartment oral dosing vs predict())
›PK fit recovery on deterministic synthetic data
›NLME smoke sanity (finite NLL/grad + fit improves NLL on synthetic multi-subject data)

PYTHONPATH=bindings/ns-py/python ./.venv/bin/python \
  tests/apex2_pharma_reference_report.py \
  --deterministic \
  --out tmp/apex2_pharma_reference_report.json

This report is included in the Apex2 master report and the validation pack produced by make validation-pack.

4.The core pitfall: "time to convergence" is not well-defined

Optimization time depends on stopping criteria, bound constraints, line search policies, and parameterization. Any benchmark that reports "fit time" without a protocol is not evidence.

Our rule: publish fixed-iteration protocols and/or convergence protocols with explicit tolerances, plus evaluation counts and final objective values.

5.Baseline models

›Individual PK — 1-compartment oral model with first-order absorption
›NLME baseline — population parameters + independent log-normal random effects (diagonal Omega), joint MAP fit

This matters because "NLME" can mean many different approximations in production tools; benchmarks must compare like with like.

6.Dataset plan: synthetic first, open datasets when possible

Every published run must include:

›Dataset ID + hash
›Generation parameters (for synthetic)
›Preprocessing protocol (for real data)
›Exact model configuration (including LLOQ policy)

7.Seed harness (public benchmarks skeleton)

For public snapshots we ship a minimal rerunnable seed harness:

# Single-case run
python benchmarks/nextstat-public-benchmarks/suites/pharma/run.py \
  --deterministic \
  --out benchmarks/nextstat-public-benchmarks/out/pharma_pk_1c_oral.json

# Suite runner (multiple generated cases)
python benchmarks/nextstat-public-benchmarks/suites/pharma/suite.py \
  --deterministic \
  --out-dir benchmarks/nextstat-public-benchmarks/out/pharma

Each generated dataset carries a stable dataset ID and a SHA-256 hash of the dataset spec.

This seed is NextStat-only today. Baseline templates (e.g. nlmixr2, Torsten) exist in the repo but are treated as follow-up work until their full environments are pinned reproducibly.

Published JSON artifact contracts (Pharma suite):

›Per-case results: nextstat.pharma_benchmark_result.v1
›Suite index: nextstat.pharma_benchmark_suite_result.v1

8.Metrics we will publish

›NLL time/call (and gradient time/call if applicable)
›Fit wall-time under the declared protocol
›Scaling curves: subjects → runtime, observations/subject → runtime, random-effects dimension → runtime
›Recovery error on synthetic data (trust gate, not a speed metric)

9.Why this belongs in the public benchmark program

PK/NLME is exactly the kind of domain where a "fast result" can be wrong or incomparable, and reproducibility is non-negotiable. We treat benchmarks as artifacts with pinned environments, correctness gates, raw result publishing, and external reruns.