Skip to content

Version History

2.2.1

  • Fix: Join and Union dtype validation now compares _dtypes metadata (user intent) instead of raw numpy dtypes. Since _format_dtype collapses StringDType, <U8, <U12 etc. all to 'str', string columns with different underlying representations no longer reject as mismatched.
  • Fix: update_dtypes_inplace preserves raw numpy dtypes for casting and maps 'str' label to StringDType(na_object=None) -- explicit update_dtypes_inplace({'x': 'str'}) now converts <U columns to nullable StringDType. Construction preserves the original dtype unchanged.
  • Fix: Left join null-fill preserves column dtypes -- string columns use StringDType(na_object=None), float columns use NaN, datetime/timedelta columns use NaT. Only int/bool/bytes fall back to object (with a warning). Note: left join results with null-filled string columns carry StringDType(na_object=None), which differs from plain StringDType().
  • Warn when left join casts a column to object dtype due to unmatched rows

2.2.0

  • Performance: Add C-accelerated composite_key for single-pass multi-column key encoding -- eliminates Python loop and temporaries
  • Performance: Add C-accelerated group_indices with O(n) hash-based group construction -- replaces np.unique + argsort + split pipeline
  • Performance: Add C-accelerated encode_strings with O(n) hash-based string-to-integer encoding -- replaces O(n log n) np.unique; 2-2.5x faster on string-column groupby
  • Performance: Multi-column GroupBy 1.9x faster (100k 2-col: 8.72 → 4.00 ms; 1M 2-col: 97 → 48 ms)
  • Performance: Add tuple_map(name=None) fast path -- skips NamedTuple construction, uses zip(*values) directly; now beats pandas itertuples
  • Modernize: Replace Dict, List, Tuple, Optional, Union with Python 3.10+ built-in generics and | union syntax; add from __future__ import annotations
  • Docs: New MkDocs Material website with custom landing page, benchmark charts, tutorials, and API reference with structured parameter tables
  • Remove RST cross-reference syntax from all docstrings
  • Fix docstring parameter name mismatches (dfs, columncolumns, group_bycolumns)
  • Normalize numpy-style docstring indentation for griffe compatibility

2.1.0

  • Performance: Rewrite GroupBy, Transform, IterateBy to use index-based grouping (_build_group_indices) instead of per-group boolean masks -- up to 23x faster on multi-column groups
  • Performance: Add numpy-native sort-merge join for equi-joins in InnerJoin and LeftJoin -- up to 20x faster than pre-2.1, now faster than pandas (1.4--7.6x) across all tested sizes
  • Performance: Fix _validate_columns to use O(1) dict lookup instead of O(n) .keys() view
  • Performance: Skip redundant np.dtype() call in _format_dtype when input is already np.dtype
  • Performance: Use dtype.kind check instead of np.dtype(object) allocation in ObjectFormatter.parse_dtype
  • Breaking: Adopt numpy StringDType for string columns -- string data previously stored as dtype=object now uses np.dtypes.StringDType() (dtype kind 'T'); internal dtype name is 'str' instead of 'object'
  • Auto-convert object arrays of Python strings to StringDType during construction and column assignment
  • CSVReader string columns now produce StringDType instead of object
  • Fix Transform.apply correctness: group-by column data was being copied before the boolean mask was fully constructed
  • Fix update_dtypes_inplace to handle StringDType columns when replacing empty strings for numeric conversion
  • Performance: Add vectorized fast path for GroupBy when aggregation functions are recognized numpy reducers -- uses np.bincount / ufunc.reduceat instead of per-group Python loops; now faster than pandas at small-to-medium scales. Recognized functions: np.sum, np.mean, np.std, np.var, np.min, np.max, np.ptp, np.prod, np.median, np.any, np.all, np.count_nonzero, len, sum, plus all nan-variants
  • Add percentile(q) aggregation factory for use in group_by -- creates vectorized fast-path callables for arbitrary percentiles
  • Add geomean and harmean aggregation functions (geometric and harmonic means) with vectorized fast path
  • Performance: Replace np.unique (O(n log n) sort) with direct array mapping (O(n)) for group label assignment -- GroupBy is now 4--8x faster than both pandas and polars at <=10k rows
  • Performance: String/object columns auto-encoded to integer codes for grouping -- eliminates the Python dict fallback for multi-column groups
  • Add chunks(n, sort_by=): split into n equal-sized Tafra chunks with optional pre-sort
  • Add chunk_rows(size, sort_by=): split by maximum row count
  • Add partition(columns, sort_by=): split by group values (like GroupBy but returns sub-Tafra instances for parallel dispatch); supports sorting within each partition
  • Add Tafra.concat(tafras): concatenate multiple Tafra row-wise
  • Add tail(n): return last n rows (complement to head)
  • Add sort(columns, reverse=): public sort API with multi-column and descending support
  • Add sample(n, seed=): random row sampling with reproducible seed
  • Add drop_duplicates(columns): deduplicate rows by column values
  • Add value_counts(column): count occurrences of each unique value
  • Add describe(): summary statistics (count, mean, std, min, quartiles, max) for all numeric columns
  • Add shift(n): lag/lead rows by n positions, filling with NaN/None
  • Optional C extension (tafra/_accel.c) for single-pass grouped aggregation (Welford variance, sum, mean, min, max, count) and O(n) hash-based equi-joins -- falls back to pure Python + numpy if not compiled
  • Fix join codebook bug: string-key joins now use shared codebook across left/right tables via _encode_columns_paired
  • Tafra.concat() validates that all tafras have matching column sets
  • Add 40 new tests (88 total)
  • Add test/bench_tafra.py performance benchmark suite
  • Add test/bench_vs_pandas_vs_polars.py -- 5-way comparison (Tafra+C, Tafra pure, pandas 2.3, pandas 3.0, polars 1.39)

2.0.0

  • Breaking: Require Python >=3.9 (was >=3.7)
  • Breaking: Require numpy >=2.1 (was >=1.17)
  • Replace Travis CI with GitHub Actions (lint, test matrix 3.9-3.13, docs)
  • Modernize .readthedocs.yml for current RTD build system
  • Bump Sphinx >=7.0, sphinx-rtd-theme >=2.0
  • Fix deprecated Sphinx html_context CSS configuration
  • Update README badges (remove Travis/Coveralls, add Python versions)

1.1.0

  • Fix LeftJoin dtype merge order (right was overwriting left)
  • Fix _parse_iterable re-iterating consumed iterable
  • Fix to_csv UnboundLocalError for unsupported file types
  • Fix CSVReader file handle leak on empty files
  • Fix IterateBy yielding inconsistent types (always tuple now)
  • Fix ObjectFormatter.__setitem__ catching its own ValueError
  • Fix ndim returning column count instead of 2
  • Fix _parse_sequence/_parse_iterable/_parse_iterator mutating caller's dicts
  • Replace bare except clauses with specific exception types
  • Remove warnings.resetwarnings() global side effect
  • Fix mutable default arguments in function signatures
  • Migrate to pyproject.toml (remove setup.cfg, .coveragerc; setup.py retained for C extension builds in 2.1.0)
  • Version now in pyproject.toml, read via importlib.metadata
  • Replace flake8 with ruff
  • Fix all 68 mypy errors (strict mode)
  • Parameterize all np.ndarray type annotations
  • Add 9 new tests covering all bug fixes (48 tests, 99% coverage)

1.0.10

  • Add pipe and overload >> operator for Tafra objects

1.0.9

  • Add test files to build

1.0.8

  • Check rows in constructor to ensure equal data length

1.0.7

  • Handle missing or NULL values in read_csv().
  • Cast empty elements to None when updating dtypes to avoid failure of np.astype().
  • Update some typing, minor refactoring for performance

1.0.6

  • Additional validations in constructor, primary to evaluate Iterables of values
  • Split col_map to col_map and key_map as the original function's return signature depending upon an argument.
  • Fix some documentation typos

1.0.5

  • Add tuple_map method
  • Refactor all iterators and ..._map functions to improve performance
  • Unpack np.ndarray if given as keys to constructor
  • Add validate=False in __post_init__ if inputs are known to be valid to improve performance

1.0.4

  • Add read_csv, to_csv
  • Various refactoring and improvement in data validation
  • Add typing_extensions to dependencies
  • Change method of dtype storage, extract str representation from np.dtype()

1.0.3

  • Add read_sql and read_sql_chunks
  • Add to_tuple and to_pandas
  • Cleanup constructor data validation

1.0.2

  • Add object_formatter to expose user formatting for dtype=object
  • Improvements to indexing and slicing

1.0.1

  • Add iter functions
  • Add map functions
  • Various constructor improvements

1.0.0

  • Initial Release