Joins
Tafra supports SQL-style joins: inner, left, and cross. The on parameter
takes a list of tuples specifying left column, right column, and comparison
operator.
Join syntax
The on parameter is a list of (left_column, right_column, operator) tuples.
Valid operators:
| Operator | Meaning |
|---|---|
'==' |
Equal to |
'!=' |
Not equal to |
'<' |
Less than |
'<=' |
Less than or equal to |
'>' |
Greater than |
'>=' |
Greater than or equal to |
The optional select parameter limits which columns appear in the output.
If omitted, all unique column names are returned, preferring the left table
when a column exists in both.
Inner Join
Returns only rows where the join condition matches in both tables. Analogous
to SQL INNER JOIN.
import numpy as np
from tafra import Tafra
orders = Tafra({
'order_id': np.array([1, 2, 3, 4]),
'customer_id': np.array([10, 20, 30, 40]),
'amount': np.array([100.0, 200.0, 150.0, 300.0]),
})
customers = Tafra({
'customer_id': np.array([10, 20, 50]),
'name': np.array(['Alice', 'Bob', 'Eve']),
})
result = orders.inner_join(
customers,
on=[('customer_id', 'customer_id', '==')],
)
Output
customer_id 30 and 40 have no match -- excluded.
customer_id 50 has no match in orders -- excluded.
SQL equivalent
Selecting specific columns
result = orders.inner_join(
customers,
on=[('customer_id', 'customer_id', '==')],
select=['order_id', 'name', 'amount'],
)
print(result.columns)
Left Join
Returns all rows from the left table, with matching rows from the right.
Unmatched right-side columns are filled with native null values where possible.
Analogous to SQL LEFT JOIN. See Null Handling & Dtype Behavior
for details on how each dtype is handled.
Output
SQL equivalent
Cross Join
Returns the Cartesian product of both tables -- every row from the left paired
with every row from the right. No on parameter is needed.
sizes = Tafra({
'size': np.array(['S', 'M', 'L']),
})
colors = Tafra({
'color': np.array(['red', 'blue']),
})
result = sizes.cross_join(colors)
Output
Result has 3 * 2 = 6 rows.
SQL equivalent
Selecting columns
Multi-column Joins
Join on multiple columns by adding more tuples to on:
left = Tafra({
'year': np.array([2023, 2023, 2024]),
'region': np.array(['east', 'west', 'east']),
'value': np.array([10, 20, 30]),
})
right = Tafra({
'year': np.array([2023, 2024]),
'region': np.array(['east', 'east']),
'budget': np.array([100, 200]),
})
result = left.inner_join(
right,
on=[
('year', 'year', '=='),
('region', 'region', '=='),
],
)
Non-equality Joins
Operators other than '==' enable range joins and inequality conditions:
events = Tafra({
'event_id': np.array([1, 2]),
'start': np.array([5, 15]),
})
windows = Tafra({
'window_id': np.array([1, 2, 3]),
'threshold': np.array([3, 10, 20]),
})
# Find events where start >= threshold
result = events.inner_join(
windows,
on=[('start', 'threshold', '>=')],
)
Note
Non-equality joins use a row-by-row loop and do not benefit from the sort-merge or hash-based fast paths. For large datasets, prefer equality joins when possible.
C-accelerated Hash Join
For equality joins ('==' operator), Tafra uses an optimized join algorithm.
If the optional C extension (_accel.c) is compiled, joins use an O(n)
hash-based implementation. Otherwise, a pure-Python sort-merge algorithm is
used. Both produce identical results -- the C extension is purely a performance
optimization.
Build the C extension with:
Null Handling & Dtype Behavior
Dtype validation on join keys
Join keys are validated using the _dtypes metadata — the user's declared
intent for column types, not the raw numpy array dtype. This means:
- String variants match:
StringDType,StringDType(na_object=None),<U8,<U12all reduce to'str'and can join freely. - Same-family numeric types are promoted:
int32vsint64, orfloat32vsfloat64, are promoted to the higher precision type usingnp.result_type. The narrower column is cast automatically. - Different families are rejected:
int64vsfloat64, orint64vsstr, raiseTypeError.
| Left key dtype | Right key dtype | Allowed? | Result dtype |
|---|---|---|---|
StringDType |
<U10 |
Yes | both keep original dtypes |
<U8 |
<U12 |
Yes | both keep original dtypes |
int32 |
int64 |
Yes (promoted) | int64 |
float32 |
float64 |
Yes (promoted) | float64 |
int32 |
int32 |
Yes | int32 |
int64 |
float64 |
No — TypeError |
— |
int64 |
StringDType |
No — TypeError |
— |
datetime64[ns] |
datetime64[us] |
Yes | both keep original dtypes |
Type promotion uses np.result_type under the hood — the same function numpy
uses to determine output dtypes for binary operations. Promotion applies only
to key matching — the original input tables are never modified, and the output
column dtype depends on which table the column came from (left takes precedence
for shared column names).
# int32 key joins with int64 key — promoted for matching, not in output
left = Tafra({'k': np.array([1, 2], dtype=np.int32), 'v': np.array([10, 20])})
right = Tafra({'k': np.array([1, 2], dtype=np.int64), 'info': np.array(['a', 'b'])})
result = left.inner_join(right, on=[('k', 'k', '==')])
# Works — k promoted to int64 internally for matching
# result['k'].dtype is int32 (left table wins)
Null values in join keys
Null values in join keys (None, NaN, NaT) follow SQL semantics:
NULL never equals NULL. Rows with null keys are excluded from matching.
| Key dtype | Null value | Behavior |
|---|---|---|
| Float | NaN |
Excluded from matching |
String (StringDType(na_object=None)) |
None |
Excluded from matching |
Datetime (datetime64) |
NaT |
Excluded from matching |
Timedelta (timedelta64) |
NaT |
Excluded from matching |
For inner joins, rows with null keys are dropped from both sides — they produce no output rows.
For left joins, left-side rows with null keys are preserved in the output with null-filled right columns (since they can't match anything).
left = Tafra({
'k': np.array([1.0, np.nan, 3.0]),
'v': np.array([10, 20, 30]),
})
right = Tafra({
'k': np.array([1.0, np.nan, 3.0]),
'info': np.array(['a', 'b', 'c']),
})
# Inner join: NaN rows excluded from both sides
result = left.inner_join(right, on=[('k', 'k', '==')])
# Result: k=[1.0, 3.0], v=[10, 30], info=['a', 'c']
# NaN row excluded — NaN != NaN
# Left join: NaN row preserved with null right columns
result = left.left_join(right, on=[('k', 'k', '==')])
# Result: k=[1.0, NaN, 3.0], v=[10, 20, 30], info=['a', None, 'c']
Tip
Avoid using float columns as join keys. If you must, filter out NaN values first, or cast to integer IDs.
Left join null-fill by dtype
When a left join has unmatched rows, right-side columns are filled with
native null values where the dtype supports them. Tafra preserves the
original column precision — float32 stays float32, not promoted to
float64.
| Column dtype | Null value | Result dtype | Warning? |
|---|---|---|---|
StringDType |
None |
StringDType(na_object=None) |
No |
<U (fixed unicode) |
None |
StringDType(na_object=None) |
No |
float32 |
NaN |
float32 (preserved) |
No |
float64 |
NaN |
float64 (preserved) |
No |
datetime64 |
NaT |
original dtype preserved | No |
timedelta64 |
NaT |
original dtype preserved | No |
int8/int16/int32/int64 |
None |
object |
Yes |
bool |
None |
object |
Yes |
bytes |
None |
object |
Yes |
For integer, boolean, and bytes columns, numpy has no native null
representation. These columns fall back to object dtype and a warning is
emitted:
UserWarning: Left join: column 'count' (dtype int64) has unmatched rows
and no native null representation. Dtype has been cast to object.
Use .astype(float) if NaN semantics are needed.
To recover a numeric dtype after the join:
result = left.left_join(right, on=[('k', 'k', '==')])
# Convert object column with None to float with NaN
result['count'] = np.where(
result['count'] == None, np.nan, # noqa: E711
result['count'],
).astype(float)
Type promotion in join results
Shared column names: When both tables have a column with the same name, the left table's version takes precedence. No type promotion occurs — the right column is simply excluded.
Null-fill preserves precision: float32 columns stay float32 after
null-fill (not widened to float64). The null value (NaN) fits in any
float width.
String promotion: <U (fixed-width unicode) columns are promoted to
StringDType(na_object=None) when null values are needed, since fixed-width
unicode cannot represent None.
Empty table joins
Joining with an empty table (zero rows) is allowed but emits a warning, since it usually indicates unexpected data:
| Scenario | Result |
|---|---|
| Inner join, either table empty | Empty Tafra with correct columns/dtypes |
| Left join, left empty | Empty Tafra with correct columns/dtypes |
| Left join, right empty | Left table with null-filled right columns |
| Cross join, either table empty | Empty Tafra with correct columns/dtypes |
Column selection
The select parameter controls which columns appear in the output.
- If omitted, all unique column names are returned
- Column order follows left-table-then-right-table order (not
selectorder) selectacts as a filter, not an ordering specification- Nonexistent column names are silently ignored
- When a column name exists in both tables, the left table's version is used
result = left.inner_join(
right,
on=[('k', 'k', '==')],
select=['k', 'info', 'v'], # acts as filter, order follows table order
)
print(result.columns) # ('k', 'v', 'info') — left columns first, then right
Non-equality join performance
Non-equality operators (!=, <, <=, >, >=) use an O(n × m)
row-by-row loop. They do not benefit from the C extension or sort-merge
optimization. This applies even when mixed with equality conditions —
if any condition uses a non-equality operator, the entire join falls back
to the row-by-row path.
For large datasets, consider restructuring the query to use an equality join followed by a filter:
# Instead of: left.inner_join(right, on=[('date', 'cutoff', '>=')])
# Do:
result = left.cross_join(right)
mask = result['date'] >= result['cutoff']
result = result._slice(np.where(mask)[0])
Row ordering
Join results are not guaranteed to be in any particular order. The C
extension (hash-based) and Python fallback (sort-merge) may produce
rows in different orders. If you need a specific sort order, call
result.sort('column') after the join. This matches SQL semantics —
SELECT without ORDER BY has no guaranteed ordering.