Iris Dataset
This tutorial walks through a complete analysis of the classic Iris flower
dataset using Tafra. We construct the data from inline arrays, explore columns,
compute basic statistics, and use group_by to aggregate by species.
Load the data
We use a 15-row subset of the Iris dataset (5 per species) to keep output readable.
import numpy as np
from tafra import Tafra
iris = Tafra({
'sepal_length': np.array([5.1, 4.9, 4.7, 5.0, 5.4,
7.0, 6.4, 6.9, 5.5, 6.5,
6.3, 5.8, 7.1, 6.3, 6.5]),
'sepal_width': np.array([3.5, 3.0, 3.2, 3.6, 3.9,
3.2, 3.2, 3.1, 2.3, 2.8,
3.3, 2.7, 3.0, 2.9, 3.0]),
'petal_length': np.array([1.4, 1.4, 1.3, 1.4, 1.7,
4.7, 4.5, 4.9, 4.0, 4.6,
6.0, 5.1, 5.9, 5.6, 5.8]),
'petal_width': np.array([0.2, 0.2, 0.2, 0.2, 0.4,
1.4, 1.5, 1.5, 1.3, 1.5,
2.5, 1.9, 2.1, 1.8, 1.8]),
'species': np.array(['setosa', 'setosa', 'setosa', 'setosa', 'setosa',
'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor',
'virginica', 'virginica', 'virginica', 'virginica', 'virginica']),
})
print(f'Rows: {iris.rows}')
print(f'Columns: {list(iris.columns)}')
Data sample (first 3 rows per species)
sepal_length | sepal_width | petal_length | petal_width | species
-------------+-------------+--------------+-------------+-----------
5.1 | 3.5 | 1.4 | 0.2 | setosa
4.9 | 3.0 | 1.4 | 0.2 | setosa
4.7 | 3.2 | 1.3 | 0.2 | setosa
... | ... | ... | ... | ...
7.0 | 3.2 | 4.7 | 1.4 | versicolor
6.4 | 3.2 | 4.5 | 1.5 | versicolor
6.9 | 3.1 | 4.9 | 1.5 | versicolor
... | ... | ... | ... | ...
6.3 | 3.3 | 6.0 | 2.5 | virginica
5.8 | 2.7 | 5.1 | 1.9 | virginica
7.1 | 3.0 | 5.9 | 2.1 | virginica
Column access
Accessing a column returns the underlying np.ndarray directly -- no wrapper
objects, no copies:
lengths = iris['sepal_length']
print(type(lengths)) # <class 'numpy.ndarray'>
print(lengths[:5]) # [5.1 4.9 4.7 5. 5.4]
You can use any numpy operation on the result:
print(f'Mean sepal length: {np.mean(iris["sepal_length"]):.2f}')
print(f'Std sepal width: {np.std(iris["sepal_width"]):.2f}')
print(f'Min petal length: {np.min(iris["petal_length"]):.2f}')
print(f'Max petal width: {np.max(iris["petal_width"]):.2f}')
Group by species -- basic aggregation
group_by produces one row per unique value in the group columns.
The aggregation dict maps output column names to functions. When the output
name matches an existing column, the function is applied to that column:
summary = iris.group_by(
['species'],
{
'sepal_length': np.mean,
'sepal_width': np.mean,
'petal_length': np.mean,
'petal_width': np.mean,
},
)
for col in summary.columns:
vals = summary[col]
if vals.dtype.kind == 'f':
print(f'{col:14s} {vals[0]:6.2f} {vals[1]:6.2f} {vals[2]:6.2f}')
else:
print(f'{col:14s} {str(vals[0]):>6s} {str(vals[1]):>6s} {str(vals[2]):>6s}')
Output
Mean measurements by species
| Species | Sepal Length | Sepal Width | Petal Length | Petal Width |
|---|---|---|---|---|
| setosa | 5.02 | 3.44 | 1.44 | 0.24 |
| versicolor | 6.48 | 2.92 | 4.54 | 1.44 |
| virginica | 6.38 | 2.94 | 5.68 | 2.02 |
Renaming output columns
Use the (function, source_column) tuple form to give aggregated columns
new names:
stats = iris.group_by(
['species'],
{
'mean_sepal_len': (np.mean, 'sepal_length'),
'std_sepal_len': (np.std, 'sepal_length'),
'mean_petal_len': (np.mean, 'petal_length'),
'std_petal_len': (np.std, 'petal_length'),
'count': (len, 'sepal_length'),
},
)
print(f'Columns: {list(stats.columns)}')
print(f'Rows: {stats.rows}')
print()
print(f'{"species":>12s} {"mean_sl":>8s} {"std_sl":>8s} {"mean_pl":>8s} {"std_pl":>8s} {"n":>3s}')
for i in range(stats.rows):
print(f'{str(stats["species"][i]):>12s}'
f' {stats["mean_sepal_len"][i]:8.2f}'
f' {stats["std_sepal_len"][i]:8.3f}'
f' {stats["mean_petal_len"][i]:8.2f}'
f' {stats["std_petal_len"][i]:8.3f}'
f' {stats["count"][i]:3.0f}')
Output
Iterating by species
iterate_by yields (keys, indices, sub_tafra) tuples -- useful when you
need full control over per-group processing:
for keys, indices, sub in iris.iterate_by(['species']):
species = keys[0]
sl = sub['sepal_length']
print(f'{species}: n={sub.rows}, sepal_length range=[{sl.min():.1f}, {sl.max():.1f}]')
Output
Multiple group columns
You can group by more than one column. Here we add a size category and group by both species and size:
iris['size'] = np.where(iris['sepal_length'] > 6.0, 'large', 'small')
multi = iris.group_by(
['species', 'size'],
{
'mean_petal_len': (np.mean, 'petal_length'),
'count': (len, 'petal_length'),
},
)
for i in range(multi.rows):
print(f'{str(multi["species"][i]):>12s} {str(multi["size"][i]):>5s}'
f' mean_petal={multi["mean_petal_len"][i]:.2f}'
f' n={multi["count"][i]:.0f}')
Output
Summary
This tutorial covered:
- Constructing a
Tafrafrom a dict of numpy arrays - Accessing columns (returns
ndarraydirectly) - Computing statistics with standard numpy functions
group_bywith same-name and renamed output columnsiterate_byfor per-group processing- Grouping by multiple columns