Arrow / Polars Integration
Zero-Copy Columnar Data Interchange
NextStat speaks Apache Arrow natively. Ingest histogram data from PyArrow, Polars, DuckDB, or Spark — and export model results back — with zero Python-side deserialization overhead. The bridge uses Arrow IPC under the hood, backed by Rust's arrow 57.3 and parquet 57.3 crates.
Data Flow
PyArrow Table / Polars DataFrame / DuckDB Result
│
▼ .to_arrow() or native
Arrow RecordBatch
│
▼ IPC serialize (~1 memcpy)
nextstat.from_arrow(table) ──► Rust arrow crate
│ │
▼ ▼
HistFactoryModel Arrow RecordBatch
│ │
▼ ▼ IPC deserialize
nextstat.to_arrow(model) PyArrow TableQuick Start
import pyarrow as pa
import nextstat
# Define histogram data as an Arrow table
table = pa.table({
"channel": ["SR", "SR", "CR"],
"sample": ["signal", "background", "background"],
"yields": [[5., 10., 15.], [100., 200., 150.], [500., 600.]],
"stat_error": [[1., 2., 3.], [10., 14., 12.], None],
})
# Create model and fit
model = nextstat.from_arrow(table, poi="mu")
result = nextstat.fit(model)
print(result)Table Schema
The input Arrow table must follow this schema. Each row represents one (channel, sample) pair.
| Column | Arrow Type | Required | Description |
|---|---|---|---|
| channel | Utf8 | yes | Channel (region) name |
| sample | Utf8 | yes | Sample (process) name |
| yields | List<Float64> | yes | Expected event counts per bin |
| stat_error | List<Float64> | no | Per-bin statistical uncertainties |
Polars
import polars as pl
import nextstat
# Read histogram data from Parquet via Polars
df = pl.read_parquet("histograms.parquet")
model = nextstat.from_arrow(df.to_arrow(), poi="mu")
# Or read Parquet directly (Rust-native, no Python overhead)
model = nextstat.from_parquet("histograms.parquet", poi="mu")DuckDB
import duckdb
import nextstat
con = duckdb.connect()
table = con.sql("""
SELECT channel, sample, yields
FROM 'histograms.parquet'
""").arrow()
model = nextstat.from_arrow(table)Export
Export model data back to Arrow for downstream analysis, dashboards, or ML pipelines.
model = nextstat.from_pyhf(workspace_json)
# Expected yields per channel
yields = nextstat.to_arrow(model, what="yields")
print(yields.to_pandas())
# channel sample yields
# 0 CR total [500.0, 600.0]
# 1 SR total [105.0, 210.0, 165.0]
# Parameter metadata
params = nextstat.to_arrow(model, what="params")
print(params.to_pandas())
# name index value bound_lo bound_hi init
# 0 mu 0 1.0 0.0 10.0 1.0
# 1 staterror_SR[0] 1 1.0 1e-10 10.0 1.0Custom Observations
# By default, Asimov data (sum of yields) is used.
# Pass observed data explicitly:
model = nextstat.from_arrow(
table,
poi="mu",
observations={
"SR": [110., 215., 170.],
"CR": [510., 590.],
},
)Low-Level IPC API
For maximum control, use the raw IPC bytes interface directly. This is what from_arrow() and to_arrow() use internally.
import pyarrow as pa
# Serialize table to IPC bytes
sink = pa.BufferOutputStream()
writer = pa.ipc.new_stream(sink, table.schema)
for batch in table.to_batches():
writer.write_batch(batch)
writer.close()
ipc_bytes = sink.getvalue().to_pybytes()
# Ingest IPC bytes directly
model = nextstat.from_arrow_ipc(ipc_bytes, poi="mu")
# Export as IPC bytes
yields_bytes = nextstat.to_arrow_yields_ipc(model)
params_bytes = nextstat.to_arrow_params_ipc(model)
# Deserialize in Python
yields_table = pa.ipc.open_stream(yields_bytes).read_all()API Reference
nextstat.from_arrow(table, *, poi, observations)— PyArrow Table/RecordBatch → HistFactoryModel.nextstat.to_arrow(model, *, params, what)— HistFactoryModel → PyArrow Table.what="yields"or"params".nextstat.from_parquet(path, *, poi, observations)— Parquet file → HistFactoryModel (Rust-native reader).nextstat.from_arrow_ipc(bytes, poi, observations)— raw IPC stream bytes → HistFactoryModel.nextstat.to_arrow_yields_ipc(model, params)— HistFactoryModel → IPC bytes (yields).nextstat.to_arrow_params_ipc(model, params)— HistFactoryModel → IPC bytes (parameters).
