Pharma Benchmarks: PK and NLME
Without Benchmark Theater — objective definitions, stopping rules, scaling protocols, correctness gates.
2026-02-08 · 8 min read
Pharmacometrics benchmarks are deceptively easy to do wrong. Two fitters can both return "a result" while measuring fundamentally different things: different objectives (MAP vs marginal likelihood vs FOCE-style approximations), different stopping rules and tolerances, different parameterizations, different handling of censoring / LLOQ.
Protocol and artifacts: Public Benchmarks. Validation pack: Validation Report. Suite runbook (repo path): docs/benchmarks/suites/pharma.md.
Abstract. We treat performance as evidence, not a screenshot. For PK/NLME that means: defining the objective precisely (likelihood, constraints, censoring policy), defining the fit protocol (stopping rule, bounds, initialization), publishing correctness gates (analytic checks + recovery on synthetic data), and publishing raw measurements + manifests so outsiders can rerun.
1.Threat model: how pharma benchmarks lie
Common failure modes that invalidate comparisons:
- ›Objective mismatch — MAP vs marginal likelihood vs FOCE/Laplace approximations
- ›Solver mismatch — different ODE solvers, tolerances, or step controls
- ›Parameterization mismatch — log-space vs linear, constrained vs unconstrained
- ›Censoring policy mismatch — LLOQ, censored likelihood vs imputation vs drop
- ›Convergence mismatch — different tolerances, line search rules, max-iter caps
- ›Dataset handling mismatch — preprocessing drift, unit conventions, time grids
If we can't align these, we don't call it a benchmark comparison — we call it two different experiments.
2.What we benchmark
- ›NLL and gradient evaluation (time/call)
- ›Fit wall-time under an explicit protocol
- ›Scaling laws with subject count and observations per subject
3.Correctness gates: analytic checks + recovery before timing
The Apex2 pharma reference runner produces deterministic, machine-readable evidence for:
- ›PK analytic correctness (closed-form 1-compartment oral dosing vs predict())
- ›PK fit recovery on deterministic synthetic data
- ›NLME smoke sanity (finite NLL/grad + fit improves NLL on synthetic multi-subject data)
PYTHONPATH=bindings/ns-py/python ./.venv/bin/python \
tests/apex2_pharma_reference_report.py \
--deterministic \
--out tmp/apex2_pharma_reference_report.jsonThis report is included in the Apex2 master report and the validation pack produced by make validation-pack.
4.The core pitfall: "time to convergence" is not well-defined
Optimization time depends on stopping criteria, bound constraints, line search policies, and parameterization. Any benchmark that reports "fit time" without a protocol is not evidence.
Our rule: publish fixed-iteration protocols and/or convergence protocols with explicit tolerances, plus evaluation counts and final objective values.
5.Baseline models
- ›Individual PK — 1-compartment oral model with first-order absorption
- ›NLME baseline — population parameters + independent log-normal random effects (diagonal Omega), joint MAP fit
This matters because "NLME" can mean many different approximations in production tools; benchmarks must compare like with like.
6.Dataset plan: synthetic first, open datasets when possible
Every published run must include:
- ›Dataset ID + hash
- ›Generation parameters (for synthetic)
- ›Preprocessing protocol (for real data)
- ›Exact model configuration (including LLOQ policy)
7.Seed harness (public benchmarks skeleton)
For public snapshots we ship a minimal rerunnable seed harness:
# Single-case run
python benchmarks/nextstat-public-benchmarks/suites/pharma/run.py \
--deterministic \
--out benchmarks/nextstat-public-benchmarks/out/pharma_pk_1c_oral.json
# Suite runner (multiple generated cases)
python benchmarks/nextstat-public-benchmarks/suites/pharma/suite.py \
--deterministic \
--out-dir benchmarks/nextstat-public-benchmarks/out/pharmaEach generated dataset carries a stable dataset ID and a SHA-256 hash of the dataset spec.
This seed is NextStat-only today. Baseline templates (e.g. nlmixr2, Torsten) exist in the repo but are treated as follow-up work until their full environments are pinned reproducibly.
Published JSON artifact contracts (Pharma suite):
- ›Per-case results:
nextstat.pharma_benchmark_result.v1 - ›Suite index:
nextstat.pharma_benchmark_suite_result.v1
8.Metrics we will publish
- ›NLL time/call (and gradient time/call if applicable)
- ›Fit wall-time under the declared protocol
- ›Scaling curves: subjects → runtime, observations/subject → runtime, random-effects dimension → runtime
- ›Recovery error on synthetic data (trust gate, not a speed metric)
9.Why this belongs in the public benchmark program
PK/NLME is exactly the kind of domain where a "fast result" can be wrong or incomparable, and reproducibility is non-negotiable. We treat benchmarks as artifacts with pinned environments, correctness gates, raw result publishing, and external reruns.
