Joins
Tafra supports SQL-style joins: inner, left, and cross. The on parameter
takes a list of tuples specifying left column, right column, and comparison
operator.
Join syntax
The on parameter is a list of (left_column, right_column, operator) tuples.
Valid operators:
| Operator | Meaning |
|---|---|
'==' |
Equal to |
'!=' |
Not equal to |
'<' |
Less than |
'<=' |
Less than or equal to |
'>' |
Greater than |
'>=' |
Greater than or equal to |
The optional select parameter limits which columns appear in the output.
If omitted, all unique column names are returned, preferring the left table
when a column exists in both.
Inner Join
Returns only rows where the join condition matches in both tables. Analogous
to SQL INNER JOIN.
import numpy as np
from tafra import Tafra
orders = Tafra({
'order_id': np.array([1, 2, 3, 4]),
'customer_id': np.array([10, 20, 30, 40]),
'amount': np.array([100.0, 200.0, 150.0, 300.0]),
})
customers = Tafra({
'customer_id': np.array([10, 20, 50]),
'name': np.array(['Alice', 'Bob', 'Eve']),
})
result = orders.inner_join(
customers,
on=[('customer_id', 'customer_id', '==')],
)
Output
customer_id 30 and 40 have no match -- excluded.
customer_id 50 has no match in orders -- excluded.
SQL equivalent
Selecting specific columns
result = orders.inner_join(
customers,
on=[('customer_id', 'customer_id', '==')],
select=['order_id', 'name', 'amount'],
)
print(result.columns)
Left Join
Returns all rows from the left table, with matching rows from the right.
Unmatched right-side columns are filled with native null values where possible.
Analogous to SQL LEFT JOIN.
Null handling in left joins
Unmatched right-side columns preserve their dtype when a native null exists:
| Column dtype | Null value | Result dtype |
|---|---|---|
String (StringDType, <U) |
None |
StringDType(na_object=None) |
Float (float32, float64) |
NaN |
original dtype preserved |
Datetime (datetime64) |
NaT |
original dtype preserved |
Timedelta (timedelta64) |
NaT |
original dtype preserved |
| Int, bool, bytes | None |
object (with warning) |
Output
SQL equivalent
Cross Join
Returns the Cartesian product of both tables -- every row from the left paired
with every row from the right. No on parameter is needed.
sizes = Tafra({
'size': np.array(['S', 'M', 'L']),
})
colors = Tafra({
'color': np.array(['red', 'blue']),
})
result = sizes.cross_join(colors)
Output
Result has 3 * 2 = 6 rows.
SQL equivalent
Selecting columns
Multi-column Joins
Join on multiple columns by adding more tuples to on:
left = Tafra({
'year': np.array([2023, 2023, 2024]),
'region': np.array(['east', 'west', 'east']),
'value': np.array([10, 20, 30]),
})
right = Tafra({
'year': np.array([2023, 2024]),
'region': np.array(['east', 'east']),
'budget': np.array([100, 200]),
})
result = left.inner_join(
right,
on=[
('year', 'year', '=='),
('region', 'region', '=='),
],
)
Non-equality Joins
Operators other than '==' enable range joins and inequality conditions:
events = Tafra({
'event_id': np.array([1, 2]),
'start': np.array([5, 15]),
})
windows = Tafra({
'window_id': np.array([1, 2, 3]),
'threshold': np.array([3, 10, 20]),
})
# Find events where start >= threshold
result = events.inner_join(
windows,
on=[('start', 'threshold', '>=')],
)
Note
Non-equality joins use a row-by-row loop and do not benefit from the sort-merge or hash-based fast paths. For large datasets, prefer equality joins when possible.
C-accelerated Hash Join
For equality joins ('==' operator), Tafra uses an optimized join algorithm.
If the optional C extension (_accel.c) is compiled, joins use an O(n)
hash-based implementation. Otherwise, a pure-Python sort-merge algorithm is
used. Both produce identical results -- the C extension is purely a performance
optimization.
Build the C extension with: