Arrow / Polars Integration

Zero-Copy Columnar Data Interchange

NextStat speaks Apache Arrow natively. Ingest histogram data from PyArrow, Polars, DuckDB, or Spark — and export model results back — with zero Python-side deserialization overhead. The bridge uses Arrow IPC under the hood, backed by Rust's arrow 57.3 and parquet 57.3 crates.

Data Flow

PyArrow Table / Polars DataFrame / DuckDB Result
        │
        ▼  .to_arrow() or native
  Arrow RecordBatch
        │
        ▼  IPC serialize (~1 memcpy)
  nextstat.from_arrow(table) ──► Rust arrow crate
        │                            │
        ▼                            ▼
  HistFactoryModel            Arrow RecordBatch
        │                            │
        ▼                            ▼  IPC deserialize
  nextstat.to_arrow(model)    PyArrow Table

Quick Start

import pyarrow as pa
import nextstat

# Define histogram data as an Arrow table
table = pa.table({
    "channel": ["SR", "SR", "CR"],
    "sample":  ["signal", "background", "background"],
    "yields":  [[5., 10., 15.], [100., 200., 150.], [500., 600.]],
    "stat_error": [[1., 2., 3.], [10., 14., 12.], None],
})

# Create model and fit
model = nextstat.from_arrow(table, poi="mu")
result = nextstat.fit(model)
print(result)

Table Schema

The input Arrow table must follow this schema. Each row represents one (channel, sample) pair.

Column	Arrow Type	Required	Description
channel	Utf8	yes	Channel (region) name
sample	Utf8	yes	Sample (process) name
yields	List<Float64>	yes	Expected event counts per bin
stat_error	List<Float64>	no	Per-bin statistical uncertainties

Polars

import polars as pl
import nextstat

# Read histogram data from Parquet via Polars
df = pl.read_parquet("histograms.parquet")
model = nextstat.from_arrow(df.to_arrow(), poi="mu")

# Or read Parquet directly (Rust-native, no Python overhead)
model = nextstat.from_parquet("histograms.parquet", poi="mu")

DuckDB

import duckdb
import nextstat

con = duckdb.connect()
table = con.sql("""
    SELECT channel, sample, yields
    FROM 'histograms.parquet'
""").arrow()

model = nextstat.from_arrow(table)

Export

Export model data back to Arrow for downstream analysis, dashboards, or ML pipelines.

model = nextstat.from_pyhf(workspace_json)

# Expected yields per channel
yields = nextstat.to_arrow(model, what="yields")
print(yields.to_pandas())
#   channel sample              yields
# 0      CR  total      [500.0, 600.0]
# 1      SR  total  [105.0, 210.0, 165.0]

# Parameter metadata
params = nextstat.to_arrow(model, what="params")
print(params.to_pandas())
#              name  index  value  bound_lo  bound_hi  init
# 0              mu      0    1.0       0.0      10.0   1.0
# 1  staterror_SR[0]    1    1.0       1e-10    10.0   1.0

Custom Observations

# By default, Asimov data (sum of yields) is used.
# Pass observed data explicitly:
model = nextstat.from_arrow(
    table,
    poi="mu",
    observations={
        "SR": [110., 215., 170.],
        "CR": [510., 590.],
    },
)

Low-Level IPC API

For maximum control, use the raw IPC bytes interface directly. This is what from_arrow() and to_arrow() use internally.

import pyarrow as pa

# Serialize table to IPC bytes
sink = pa.BufferOutputStream()
writer = pa.ipc.new_stream(sink, table.schema)
for batch in table.to_batches():
    writer.write_batch(batch)
writer.close()
ipc_bytes = sink.getvalue().to_pybytes()

# Ingest IPC bytes directly
model = nextstat.from_arrow_ipc(ipc_bytes, poi="mu")

# Export as IPC bytes
yields_bytes = nextstat.to_arrow_yields_ipc(model)
params_bytes = nextstat.to_arrow_params_ipc(model)

# Deserialize in Python
yields_table = pa.ipc.open_stream(yields_bytes).read_all()

API Reference

nextstat.from_arrow(table, *, poi, observations) — PyArrow Table/RecordBatch → HistFactoryModel.
nextstat.to_arrow(model, *, params, what) — HistFactoryModel → PyArrow Table. what="yields" or "params".
nextstat.from_parquet(path, *, poi, observations) — Parquet file → HistFactoryModel (Rust-native reader).
nextstat.from_arrow_ipc(bytes, poi, observations) — raw IPC stream bytes → HistFactoryModel.
nextstat.to_arrow_yields_ipc(model, params) — HistFactoryModel → IPC bytes (yields).
nextstat.to_arrow_params_ipc(model, params) — HistFactoryModel → IPC bytes (parameters).