Joins

Tafra supports SQL-style joins: inner, left, and cross. The on parameter takes a list of tuples specifying left column, right column, and comparison operator.

Join syntax

result = left.inner_join(right, on=[('left_col', 'right_col', '==')])

The on parameter is a list of (left_column, right_column, operator) tuples. Valid operators:

Operator	Meaning
`'=='`	Equal to
`'!='`	Not equal to
`'<'`	Less than
`'<='`	Less than or equal to
`'>'`	Greater than
`'>='`	Greater than or equal to

The optional select parameter limits which columns appear in the output. If omitted, all unique column names are returned, preferring the left table when a column exists in both.

Inner Join

Returns only rows where the join condition matches in both tables. Analogous to SQL INNER JOIN.

import numpy as np
from tafra import Tafra

orders = Tafra({
    'order_id': np.array([1, 2, 3, 4]),
    'customer_id': np.array([10, 20, 30, 40]),
    'amount': np.array([100.0, 200.0, 150.0, 300.0]),
})

customers = Tafra({
    'customer_id': np.array([10, 20, 50]),
    'name': np.array(['Alice', 'Bob', 'Eve']),
})

result = orders.inner_join(
    customers,
    on=[('customer_id', 'customer_id', '==')],
)

Output

order_id  customer_id  amount  name
1         10           100.0   'Alice'
2         20           200.0   'Bob'

customer_id 30 and 40 have no match -- excluded. customer_id 50 has no match in orders -- excluded.

SQL equivalent

SELECT *
FROM orders
INNER JOIN customers
  ON orders.customer_id = customers.customer_id

Selecting specific columns

result = orders.inner_join(
    customers,
    on=[('customer_id', 'customer_id', '==')],
    select=['order_id', 'name', 'amount'],
)
print(result.columns)

Output

('order_id', 'name', 'amount')

Left Join

Returns all rows from the left table, with matching rows from the right. Unmatched right-side columns are filled with native null values where possible. Analogous to SQL LEFT JOIN. See Null Handling & Dtype Behavior for details on how each dtype is handled.

result = orders.left_join(
    customers,
    on=[('customer_id', 'customer_id', '==')],
)

Output

order_id  customer_id  amount  name
1         10           100.0   'Alice'
2         20           200.0   'Bob'
3         30           150.0   None
4         40           300.0   None

SQL equivalent

SELECT *
FROM orders
LEFT JOIN customers
  ON orders.customer_id = customers.customer_id

Cross Join

Returns the Cartesian product of both tables -- every row from the left paired with every row from the right. No on parameter is needed.

sizes = Tafra({
    'size': np.array(['S', 'M', 'L']),
})

colors = Tafra({
    'color': np.array(['red', 'blue']),
})

result = sizes.cross_join(colors)

Output

size  color
'S'   'red'
'S'   'blue'
'M'   'red'
'M'   'blue'
'L'   'red'
'L'   'blue'

Result has 3 * 2 = 6 rows.

SQL equivalent

SELECT *
FROM sizes
CROSS JOIN colors

Selecting columns

result = sizes.cross_join(colors, select=['size', 'color'])

Multi-column Joins

Join on multiple columns by adding more tuples to on:

left = Tafra({
    'year': np.array([2023, 2023, 2024]),
    'region': np.array(['east', 'west', 'east']),
    'value': np.array([10, 20, 30]),
})

right = Tafra({
    'year': np.array([2023, 2024]),
    'region': np.array(['east', 'east']),
    'budget': np.array([100, 200]),
})

result = left.inner_join(
    right,
    on=[
        ('year', 'year', '=='),
        ('region', 'region', '=='),
    ],
)

Output

year  region  value  budget
2023  'east'  10     100
2024  'east'  30     200

Non-equality Joins

Operators other than '==' enable range joins and inequality conditions:

events = Tafra({
    'event_id': np.array([1, 2]),
    'start': np.array([5, 15]),
})

windows = Tafra({
    'window_id': np.array([1, 2, 3]),
    'threshold': np.array([3, 10, 20]),
})

# Find events where start >= threshold
result = events.inner_join(
    windows,
    on=[('start', 'threshold', '>=')],
)

Output

event_id  start  window_id  threshold
1         5      1          3
2         15     1          3
2         15     2          10

Note

Non-equality joins use a row-by-row loop and do not benefit from the sort-merge or hash-based fast paths. For large datasets, prefer equality joins when possible.

C-accelerated Hash Join

For equality joins ('==' operator), Tafra uses an optimized join algorithm. If the optional C extension (_accel.c) is compiled, joins use an O(n) hash-based implementation. Otherwise, a pure-Python sort-merge algorithm is used. Both produce identical results -- the C extension is purely a performance optimization.

Build the C extension with:

python setup.py build_ext --inplace

Null Handling & Dtype Behavior

Dtype validation on join keys

Join keys are validated using the _dtypes metadata — the user's declared intent for column types, not the raw numpy array dtype. This means:

String variants match: StringDType, StringDType(na_object=None), <U8, <U12 all reduce to 'str' and can join freely.
Same-family numeric types are promoted: int32 vs int64, or float32 vs float64, are promoted to the higher precision type using np.result_type. The narrower column is cast automatically.
Different families are rejected: int64 vs float64, or int64 vs str, raise TypeError.

Left key dtype	Right key dtype	Allowed?	Result dtype
`StringDType`	`<U10`	Yes	both keep original dtypes
`<U8`	`<U12`	Yes	both keep original dtypes
`int32`	`int64`	Yes (promoted)	`int64`
`float32`	`float64`	Yes (promoted)	`float64`
`int32`	`int32`	Yes	`int32`
`int64`	`float64`	No — `TypeError`	—
`int64`	`StringDType`	No — `TypeError`	—
`datetime64[ns]`	`datetime64[us]`	Yes	both keep original dtypes

Type promotion uses np.result_type under the hood — the same function numpy uses to determine output dtypes for binary operations. Promotion applies only to key matching — the original input tables are never modified, and the output column dtype depends on which table the column came from (left takes precedence for shared column names).

# int32 key joins with int64 key — promoted for matching, not in output
left = Tafra({'k': np.array([1, 2], dtype=np.int32), 'v': np.array([10, 20])})
right = Tafra({'k': np.array([1, 2], dtype=np.int64), 'info': np.array(['a', 'b'])})
result = left.inner_join(right, on=[('k', 'k', '==')])
# Works — k promoted to int64 internally for matching
# result['k'].dtype is int32 (left table wins)

Null values in join keys

Null values in join keys (None, NaN, NaT) follow SQL semantics: NULL never equals NULL. Rows with null keys are excluded from matching.

Key dtype	Null value	Behavior
Float	`NaN`	Excluded from matching
String (`StringDType(na_object=None)`)	`None`	Excluded from matching
Datetime (`datetime64`)	`NaT`	Excluded from matching
Timedelta (`timedelta64`)	`NaT`	Excluded from matching

For inner joins, rows with null keys are dropped from both sides — they produce no output rows.

For left joins, left-side rows with null keys are preserved in the output with null-filled right columns (since they can't match anything).

left = Tafra({
    'k': np.array([1.0, np.nan, 3.0]),
    'v': np.array([10, 20, 30]),
})
right = Tafra({
    'k': np.array([1.0, np.nan, 3.0]),
    'info': np.array(['a', 'b', 'c']),
})

# Inner join: NaN rows excluded from both sides
result = left.inner_join(right, on=[('k', 'k', '==')])
# Result: k=[1.0, 3.0], v=[10, 30], info=['a', 'c']
# NaN row excluded — NaN != NaN

# Left join: NaN row preserved with null right columns
result = left.left_join(right, on=[('k', 'k', '==')])
# Result: k=[1.0, NaN, 3.0], v=[10, 20, 30], info=['a', None, 'c']

Tip

Avoid using float columns as join keys. If you must, filter out NaN values first, or cast to integer IDs.

Left join null-fill by dtype

When a left join has unmatched rows, right-side columns are filled with native null values where the dtype supports them. Tafra preserves the original column precision — float32 stays float32, not promoted to float64.

Column dtype	Null value	Result dtype	Warning?
`StringDType`	`None`	`StringDType(na_object=None)`	No
`<U` (fixed unicode)	`None`	`StringDType(na_object=None)`	No
`float32`	`NaN`	`float32` (preserved)	No
`float64`	`NaN`	`float64` (preserved)	No
`datetime64`	`NaT`	original dtype preserved	No
`timedelta64`	`NaT`	original dtype preserved	No
`int8`/`int16`/`int32`/`int64`	`None`	`object`	Yes
`bool`	`None`	`object`	Yes
`bytes`	`None`	`object`	Yes

For integer, boolean, and bytes columns, numpy has no native null representation. These columns fall back to object dtype and a warning is emitted:

UserWarning: Left join: column 'count' (dtype int64) has unmatched rows
and no native null representation. Dtype has been cast to object.
Use .astype(float) if NaN semantics are needed.

To recover a numeric dtype after the join:

result = left.left_join(right, on=[('k', 'k', '==')])

# Convert object column with None to float with NaN
result['count'] = np.where(
    result['count'] == None, np.nan,  # noqa: E711
    result['count'],
).astype(float)

Type promotion in join results

Shared column names: When both tables have a column with the same name, the left table's version takes precedence. No type promotion occurs — the right column is simply excluded.

Null-fill preserves precision: float32 columns stay float32 after null-fill (not widened to float64). The null value (NaN) fits in any float width.

String promotion: <U (fixed-width unicode) columns are promoted to StringDType(na_object=None) when null values are needed, since fixed-width unicode cannot represent None.

Empty table joins

Joining with an empty table (zero rows) is allowed but emits a warning, since it usually indicates unexpected data:

UserWarning: Join: one or both tables have zero rows.
Returning shortcut result.

Scenario	Result
Inner join, either table empty	Empty Tafra with correct columns/dtypes
Left join, left empty	Empty Tafra with correct columns/dtypes
Left join, right empty	Left table with null-filled right columns
Cross join, either table empty	Empty Tafra with correct columns/dtypes

Column selection

The select parameter controls which columns appear in the output.

If omitted, all unique column names are returned
Column order follows left-table-then-right-table order (not select order)
select acts as a filter, not an ordering specification
Nonexistent column names are silently ignored
When a column name exists in both tables, the left table's version is used

result = left.inner_join(
    right,
    on=[('k', 'k', '==')],
    select=['k', 'info', 'v'],  # acts as filter, order follows table order
)
print(result.columns)  # ('k', 'v', 'info') — left columns first, then right

Non-equality join performance

Non-equality operators (!=, <, <=, >, >=) use an O(n × m) row-by-row loop. They do not benefit from the C extension or sort-merge optimization. This applies even when mixed with equality conditions — if any condition uses a non-equality operator, the entire join falls back to the row-by-row path.

For large datasets, consider restructuring the query to use an equality join followed by a filter:

# Instead of: left.inner_join(right, on=[('date', 'cutoff', '>=')])
# Do:
result = left.cross_join(right)
mask = result['date'] >= result['cutoff']
result = result._slice(np.where(mask)[0])

Row ordering

Join results are not guaranteed to be in any particular order. The C extension (hash-based) and Python fallback (sort-merge) may produce rows in different orders. If you need a specific sort order, call result.sort('column') after the join. This matches SQL semantics — SELECT without ORDER BY has no guaranteed ordering.