Iris Dataset

This tutorial walks through a complete analysis of the classic Iris flower dataset using Tafra. We construct the data from inline arrays, explore columns, compute basic statistics, and use group_by to aggregate by species.

Load the data

We use a 15-row subset of the Iris dataset (5 per species) to keep output readable.

import numpy as np
from tafra import Tafra

iris = Tafra({
    'sepal_length': np.array([5.1, 4.9, 4.7, 5.0, 5.4,
                               7.0, 6.4, 6.9, 5.5, 6.5,
                               6.3, 5.8, 7.1, 6.3, 6.5]),
    'sepal_width':  np.array([3.5, 3.0, 3.2, 3.6, 3.9,
                               3.2, 3.2, 3.1, 2.3, 2.8,
                               3.3, 2.7, 3.0, 2.9, 3.0]),
    'petal_length': np.array([1.4, 1.4, 1.3, 1.4, 1.7,
                               4.7, 4.5, 4.9, 4.0, 4.6,
                               6.0, 5.1, 5.9, 5.6, 5.8]),
    'petal_width':  np.array([0.2, 0.2, 0.2, 0.2, 0.4,
                               1.4, 1.5, 1.5, 1.3, 1.5,
                               2.5, 1.9, 2.1, 1.8, 1.8]),
    'species':      np.array(['setosa', 'setosa', 'setosa', 'setosa', 'setosa',
                               'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor',
                               'virginica', 'virginica', 'virginica', 'virginica', 'virginica']),
})

print(f'Rows: {iris.rows}')
print(f'Columns: {list(iris.columns)}')

Output

Rows: 15
Columns: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

Data sample (first 3 rows per species)

sepal_length | sepal_width | petal_length | petal_width | species
-------------+-------------+--------------+-------------+-----------
       5.1   |       3.5   |        1.4   |       0.2   | setosa
       4.9   |       3.0   |        1.4   |       0.2   | setosa
       4.7   |       3.2   |        1.3   |       0.2   | setosa
       ...   |       ...   |        ...   |       ...   | ...
       7.0   |       3.2   |        4.7   |       1.4   | versicolor
       6.4   |       3.2   |        4.5   |       1.5   | versicolor
       6.9   |       3.1   |        4.9   |       1.5   | versicolor
       ...   |       ...   |        ...   |       ...   | ...
       6.3   |       3.3   |        6.0   |       2.5   | virginica
       5.8   |       2.7   |        5.1   |       1.9   | virginica
       7.1   |       3.0   |        5.9   |       2.1   | virginica

Column access

Accessing a column returns the underlying np.ndarray directly -- no wrapper objects, no copies:

lengths = iris['sepal_length']
print(type(lengths))  # <class 'numpy.ndarray'>
print(lengths[:5])    # [5.1 4.9 4.7 5.  5.4]

You can use any numpy operation on the result:

print(f'Mean sepal length: {np.mean(iris["sepal_length"]):.2f}')
print(f'Std sepal width:   {np.std(iris["sepal_width"]):.2f}')
print(f'Min petal length:  {np.min(iris["petal_length"]):.2f}')
print(f'Max petal width:   {np.max(iris["petal_width"]):.2f}')

Output

Mean sepal length: 5.97
Std sepal width:   0.36
Min petal length:  1.30
Max petal width:   2.50

Group by species -- basic aggregation

group_by produces one row per unique value in the group columns. The aggregation dict maps output column names to functions. When the output name matches an existing column, the function is applied to that column:

summary = iris.group_by(
    ['species'],
    {
        'sepal_length': np.mean,
        'sepal_width': np.mean,
        'petal_length': np.mean,
        'petal_width': np.mean,
    },
)

for col in summary.columns:
    vals = summary[col]
    if vals.dtype.kind == 'f':
        print(f'{col:14s}  {vals[0]:6.2f}  {vals[1]:6.2f}  {vals[2]:6.2f}')
    else:
        print(f'{col:14s}  {str(vals[0]):>6s}  {str(vals[1]):>6s}  {str(vals[2]):>6s}')

Output

species        setosa  versicolor  virginica
sepal_length     5.02    6.46    6.40
sepal_width      3.44    2.92    2.98
petal_length     1.44    4.54    5.68
petal_width      0.24    1.44    2.02

Mean measurements by species

              | setosa | versicolor | virginica
--------------+--------+------------+----------
sepal_length  |   5.02 |       6.46 |      6.40
sepal_width   |   3.44 |       2.92 |      2.98
petal_length  |   1.44 |       4.54 |      5.68
petal_width   |   0.24 |       1.44 |      2.02

Mean Measurements by Species

Species	Sepal Length	Sepal Width	Petal Length	Petal Width
setosa	5.02	3.44	1.44	0.24
versicolor	6.48	2.92	4.54	1.44
virginica	6.38	2.94	5.68	2.02

Mean Sepal Length by Species (cm)

setosa

5.02

versicolor

6.46

virginica

6.40

Mean Petal Length by Species (cm)

setosa

1.44

versicolor

4.54

virginica

5.68

Renaming output columns

Use the (function, source_column) tuple form to give aggregated columns new names:

stats = iris.group_by(
    ['species'],
    {
        'mean_sepal_len': (np.mean, 'sepal_length'),
        'std_sepal_len':  (np.std, 'sepal_length'),
        'mean_petal_len': (np.mean, 'petal_length'),
        'std_petal_len':  (np.std, 'petal_length'),
        'count':          (len, 'sepal_length'),
    },
)

print(f'Columns: {list(stats.columns)}')
print(f'Rows:    {stats.rows}')
print()
print(f'{"species":>12s}  {"mean_sl":>8s}  {"std_sl":>8s}  {"mean_pl":>8s}  {"std_pl":>8s}  {"n":>3s}')
for i in range(stats.rows):
    print(f'{str(stats["species"][i]):>12s}'
          f'  {stats["mean_sepal_len"][i]:8.2f}'
          f'  {stats["std_sepal_len"][i]:8.3f}'
          f'  {stats["mean_petal_len"][i]:8.2f}'
          f'  {stats["std_petal_len"][i]:8.3f}'
          f'  {stats["count"][i]:3.0f}')

Output

Columns: ['species', 'mean_sepal_len', 'std_sepal_len', 'mean_petal_len', 'std_petal_len', 'count']
Rows:    3

     species  mean_sl    std_sl  mean_pl    std_pl    n
      setosa      5.02     0.228      1.44     0.141    5
  versicolor      6.46     0.506      4.54     0.296    5
   virginica      6.40     0.418      5.68     0.327    5

Iterating by species

iterate_by yields (keys, indices, sub_tafra) tuples -- useful when you need full control over per-group processing:

for keys, indices, sub in iris.iterate_by(['species']):
    species = keys[0]
    sl = sub['sepal_length']
    print(f'{species}: n={sub.rows}, sepal_length range=[{sl.min():.1f}, {sl.max():.1f}]')

Output

setosa: n=5, sepal_length range=[4.7, 5.4]
versicolor: n=5, sepal_length range=[5.5, 7.0]
virginica: n=5, sepal_length range=[5.8, 7.1]

Multiple group columns

You can group by more than one column. Here we add a size category and group by both species and size:

iris['size'] = np.where(iris['sepal_length'] > 6.0, 'large', 'small')

multi = iris.group_by(
    ['species', 'size'],
    {
        'mean_petal_len': (np.mean, 'petal_length'),
        'count': (len, 'petal_length'),
    },
)

for i in range(multi.rows):
    print(f'{str(multi["species"][i]):>12s}  {str(multi["size"][i]):>5s}'
          f'  mean_petal={multi["mean_petal_len"][i]:.2f}'
          f'  n={multi["count"][i]:.0f}')

Output

      setosa  small  mean_petal=1.44  n=5
  versicolor  small  mean_petal=4.00  n=1
  versicolor  large  mean_petal=4.68  n=4
   virginica  large  mean_petal=5.68  n=5

Summary

This tutorial covered:

Constructing a Tafra from a dict of numpy arrays
Accessing columns (returns ndarray directly)
Computing statistics with standard numpy functions
group_by with same-name and renamed output columns
iterate_by for per-group processing
Grouping by multiple columns