Getting Started
Installation
Install from PyPI (includes pre-built C extension for all major platforms):
Or from conda-forge:
Both methods install pre-built wheels with the optional C extension already compiled. No compiler needed.
Your First Tafra
A Tafra is a set of named columns, each a typed numpy array of the same
length -- like a stripped-down dataframe backed directly by numpy.
Construct from a dict
import numpy as np
from tafra import Tafra
t = Tafra({
'x': np.array([1, 2, 3, 4]),
'y': np.array(['one', 'two', 'one', 'two']),
})
Access columns
Column access returns the underlying numpy array directly -- no wrapper objects:
Inspect the contents
Output
Properties
Iterate rows
# As named tuples
for row in t.itertuples():
print(row.x, row.y)
# As single-row Tafra objects
for row in t.iterrows():
print(row['x'], row['y'])
Select and slice
# Select specific columns (returns a new Tafra, no copy)
t.select(['x'])
# Slice rows with numpy indexing
t._slice(slice(0, 2)) # first two rows
t._index(np.array([0, 3])) # rows 0 and 3
Read from CSV
Convert to and from pandas
import pandas as pd
# pandas -> Tafra
df = pd.DataFrame({'x': [1, 2, 3], 'y': ['a', 'b', 'c']})
t = Tafra.from_dataframe(df)
# Tafra -> pandas
df = pd.DataFrame(t.data)
Basic Operations
GroupBy -- aggregate to one row per group
group_by reduces rows by applying aggregation functions to each group, like
SQL GROUP BY:
t = Tafra({
'region': np.array(['east', 'east', 'west', 'west']),
'sales': np.array([100, 200, 150, 250]),
'units': np.array([10, 20, 15, 25]),
})
result = t.group_by(
['region'],
{'sales': np.sum, 'units': np.sum},
)
print(result['region'])
print(result['sales'])
print(result['units'])
You can also rename columns during aggregation:
result = t.group_by(
['region'],
{'total_sales': (np.sum, 'sales'), 'avg_sales': (np.mean, 'sales')},
)
GroupBy detects known numpy reducers (np.sum, np.mean, np.std,
np.min, np.max, np.median, np.prod, len, etc.) and uses vectorized
np.bincount/ufunc.reduceat instead of per-group Python loops.
Transform -- aggregate and broadcast back
transform groups like group_by, but broadcasts the result back to the
original row count (like pandas.groupby().transform()):
t = Tafra({
'region': np.array(['east', 'east', 'west', 'west']),
'sales': np.array([100, 200, 150, 250]),
})
result = t.transform(
['region'],
{'sales': np.sum},
)
print(result['sales'])
Inner Join
Joins use SQL-style (left_col, right_col, operator) tuples:
left = Tafra({
'id': np.array([1, 2, 3]),
'name': np.array(['Alice', 'Bob', 'Carol']),
})
right = Tafra({
'id': np.array([2, 3, 4]),
'score': np.array([85, 92, 78]),
})
joined = left.inner_join(
right,
on=[('id', 'id', '==')],
)
print(joined['id'])
print(joined['name'])
print(joined['score'])
left_join and cross_join follow the same pattern.
IterateBy -- grouped iteration
iterate_by yields each group as a sub-Tafra for custom processing:
for keys, indices, sub_tafra in t.iterate_by(['region']):
print(f"Region: {keys}, rows: {sub_tafra.rows}")
# process sub_tafra however you like
Partition -- split for parallel processing
partition splits a Tafra into sub-Tafras by group values, designed for
multiprocessing dispatch. Unlike group_by, it preserves all original rows:
from concurrent.futures import ProcessPoolExecutor
parts = t.partition(['region'], sort_by=['sales'])
with ProcessPoolExecutor(max_workers=4) as pool:
results = list(pool.map(
process_fn, [sub for _, sub in parts]
))
combined = Tafra.concat(results)
Building from Source
Only needed for development — pip install tafra already includes
pre-built wheels with the C extension.
Requirements
- Python >= 3.9
- numpy >= 2.1
- A C compiler (for the
_accelextension):- Windows: Visual Studio Build Tools (with Windows SDK) or MinGW-w64
- Linux:
gcc(usually pre-installed, orapt install build-essential) - macOS: Xcode Command Line Tools (
xcode-select --install)
Build the C extension
On Windows with MinGW:
Verify the C extension is active
Build a distributable wheel
Windows build notes
The C extension requires the MSVC compiler to find the Windows SDK headers. If
you get fatal error C1083: Cannot open include file: 'io.h', the Windows SDK
include/lib paths are not set. Two options:
-
Use a Developer Command Prompt (recommended): Open "Developer Command Prompt for VS" or "Developer PowerShell for VS" from the Start menu. This runs
vcvarsall.batautomatically and sets all required paths. -
Use MinGW-w64 instead of MSVC:
MinGW-w64 can be installed via conda (
conda install m2w64-gcc -c conda-forge) or from winlibs.com.
If building with python -m build (which creates an isolated environment), use
--no-isolation to inherit your shell's environment variables, or run from a
Developer Command Prompt:
Next Steps
- API Reference -- full documentation of all classes and methods
- Benchmarks -- performance comparisons against pandas and polars
- Changelog -- version history and release notes
- GitHub -- source code, issues, and contributions