Titanic Analysis

This tutorial demonstrates survival analysis on a Titanic-style dataset. We cover missing-value handling with ObjectFormatter, grouped aggregation, transform to broadcast group statistics back to every row, and partition for parallel dispatch.

Build the dataset

We construct an inline dataset with 20 passengers. Some age values are missing (represented as None in the source, which numpy stores as nan).

import numpy as np
from tafra import Tafra

titanic = Tafra({
    'pclass':   np.array([1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3]),
    'sex':      np.array(['male', 'female', 'female', 'male', 'male', 'female',
                           'male', 'female', 'male', 'female', 'male', 'male', 'female',
                           'male', 'male', 'female', 'male', 'male', 'female', 'male']),
    'age':      np.array([38.0, 26.0, 35.0, 54.0, np.nan, 58.0,
                           34.0, 28.0, np.nan, 14.0, 36.0, 45.0, 30.0,
                           22.0, np.nan, np.nan, 19.0, 32.0, 16.0, 25.0]),
    'survived': np.array([0, 1, 1, 0, 0, 1,
                           0, 1, 0, 1, 0, 0, 1,
                           0, 0, 1, 0, 0, 1, 0]),
    'fare':     np.array([71.3, 71.3, 53.1, 51.9, 30.5, 26.6,
                           13.0, 13.0, 13.0, 13.0, 26.0, 26.0, 13.0,
                           7.9, 8.1, 7.7, 7.9, 7.9, 7.7, 8.1]),
})

print(f'Rows: {titanic.rows}')
print(f'Missing ages: {np.sum(np.isnan(titanic["age"]))}')

Output

Rows: 20
Missing ages: 4

Data sample (first 10 rows)

pclass | sex    |  age  | survived |  fare
-------+--------+-------+----------+------
     1 | male   | 38.0  |        0 | 71.3
     1 | female | 26.0  |        1 | 71.3
     1 | female | 35.0  |        1 | 53.1
     1 | male   | 54.0  |        0 | 51.9
     1 | male   |  NaN  |        0 | 30.5
     1 | female | 58.0  |        1 | 26.6
     2 | male   | 34.0  |        0 | 13.0
     2 | female | 28.0  |        1 | 13.0
     2 | male   |  NaN  |        0 | 13.0
     2 | female | 14.0  |        1 | 13.0

Handle missing values

Replace nan ages with the class-specific median age. We use transform to compute the per-class median and broadcast it to every row, then fill missing values:

class_medians = titanic.transform(
    ['pclass'],
    {
        'median_age': (np.nanmedian, 'age'),
    },
)

# class_medians has the same 20 rows as titanic, with median_age filled per class
missing = np.isnan(titanic['age'])
titanic['age'][missing] = class_medians['median_age'][missing]

print(f'Missing ages after fill: {np.sum(np.isnan(titanic["age"]))}')
print(f'Filled ages: {titanic["age"][np.array([4, 8, 14, 15])]}')

Output

Missing ages after fill: 0
Filled ages: [38.  34.  22.  22. ]

The missing values were filled with their class median: 38.0 for class 1, 34.0 for class 2, and 22.0 for class 3.

Survival rate by class

Use group_by with np.mean on the survived column (0/1 encoding) to compute survival rates:

by_class = titanic.group_by(
    ['pclass'],
    {
        'survival_rate': (np.mean, 'survived'),
        'mean_age':      (np.mean, 'age'),
        'mean_fare':     (np.mean, 'fare'),
        'count':         (len, 'survived'),
    },
)

for i in range(by_class.rows):
    print(f'Class {by_class["pclass"][i]}:'
          f'  survival={by_class["survival_rate"][i]:.2%}'
          f'  age={by_class["mean_age"][i]:.1f}'
          f'  fare={by_class["mean_fare"][i]:.1f}'
          f'  n={by_class["count"][i]:.0f}')

Output

Class 1:  survival=50.00%  age=40.2  fare=50.8  n=6
Class 2:  survival=42.86%  age=31.6  fare=16.7  n=7
Class 3:  survival=28.57%  age=22.7  fare=7.9  n=7

Survival by class -- table

pclass | survival_rate | mean_age | mean_fare | count
-------+---------------+----------+-----------+------
     1 |        50.00% |     40.2 |      50.8 |     6
     2 |        42.86% |     31.6 |      16.7 |     7
     3 |        28.57% |     22.7 |       7.9 |     7

Survival Rate by Passenger Class (%)

Class 1

50.00

Class 2

42.86

Class 3

28.57

Survival rate by class and sex

Group by two columns to get a cross-tabulation:

cross = titanic.group_by(
    ['pclass', 'sex'],
    {
        'survival_rate': (np.mean, 'survived'),
        'count':         (len, 'survived'),
    },
)

for i in range(cross.rows):
    print(f'Class {cross["pclass"][i]} / {str(cross["sex"][i]):>6s}:'
          f'  survival={cross["survival_rate"][i]:.0%}'
          f'  n={cross["count"][i]:.0f}')

Output

Class 1 /   male:  survival=0%  n=3
Class 1 / female:  survival=100%  n=3
Class 2 /   male:  survival=0%  n=4
Class 2 / female:  survival=100%  n=3
Class 3 /   male:  survival=0%  n=5
Class 3 / female:  survival=100%  n=2

Cross-tabulation: survival rate by class and sex

pclass | sex    | survival_rate | count
-------+--------+---------------+------
     1 | male   |            0% |     3
     1 | female |          100% |     3
     2 | male   |            0% |     4
     2 | female |          100% |     3
     3 | male   |            0% |     5
     3 | female |          100% |     2

Survival Rate by Class and Sex (%)

1/M

0

1/F

100

2/M

0

2/F

100

3/M

0

3/F

100

Transform: add group statistics to rows

transform computes group-level aggregates and broadcasts them back to every row -- the output has the same row count as the input. This is useful for computing relative metrics (e.g. "how does this passenger's fare compare to the class average?"):

enriched = titanic.transform(
    ['pclass'],
    {
        'class_mean_fare': (np.mean, 'fare'),
        'class_survival':  (np.mean, 'survived'),
    },
)

# enriched has 20 rows, same as titanic
print(f'Rows: {enriched.rows}')

# Compute fare ratio for first 5 passengers
fare_ratio = titanic['fare'][:5] / enriched['class_mean_fare'][:5]
print(f'Fare ratios (first 5): {np.round(fare_ratio, 2)}')

Output

Rows: 20
Fare ratios (first 5): [1.4  1.4  1.05 1.02 0.6 ]

Partition by class for parallel analysis

partition splits the data into sub-Tafras by group, preserving all rows. This is designed for dispatching to multiprocessing.Pool.map():

parts = titanic.partition(['pclass'])

for key, sub in parts:
    surv = np.mean(sub['survived'])
    print(f'Class {key[0]}: {sub.rows} rows, survival={surv:.2%}')

Output

Class 1: 6 rows, survival=50.00%
Class 2: 7 rows, survival=42.86%
Class 3: 7 rows, survival=28.57%

In a real pipeline you would pass these partitions to worker processes:

from multiprocessing import Pool

def analyze_class(args):
    key, sub = args
    return Tafra({
        'pclass':        np.array([key[0]]),
        'survival_rate': np.array([np.mean(sub['survived'])]),
        'mean_fare':     np.array([np.mean(sub['fare'])]),
    })

# Example (serial execution here for demonstration):
results = [analyze_class(part) for part in parts]
combined = Tafra.concat(results)

for i in range(combined.rows):
    print(f'Class {combined["pclass"][i]}: '
          f'survival={combined["survival_rate"][i]:.2%}, '
          f'fare={combined["mean_fare"][i]:.1f}')

Output

Class 1: survival=50.00%, fare=50.8
Class 2: survival=42.86%, fare=16.7
Class 3: survival=28.57%, fare=7.9

iterate_by for custom per-group logic

When you need full control over what happens per group, use iterate_by:

for keys, indices, sub in titanic.iterate_by(['pclass']):
    survivors = sub['age'][sub['survived'] == 1]
    non_surv  = sub['age'][sub['survived'] == 0]
    print(f'Class {keys[0]}:')
    print(f'  Survivor mean age:     {np.mean(survivors):.1f} (n={len(survivors)})')
    print(f'  Non-survivor mean age: {np.mean(non_surv):.1f} (n={len(non_surv)})')

Output

Class 1:
  Survivor mean age:     39.7 (n=3)
  Non-survivor mean age: 40.7 (n=3)
Class 2:
  Survivor mean age:     24.0 (n=3)
  Non-survivor mean age: 37.3 (n=4)
Class 3:
  Survivor mean age:     19.0 (n=2)
  Non-survivor mean age: 24.9 (n=5)

Mean Age: Survivors vs Non-Survivors by Class

Class 1 Surv

39.7

Class 1 Non

40.7

Class 2 Surv

24.0

Class 2 Non

37.3

Class 3 Surv

19.0

Class 3 Non

24.9

Summary

This tutorial covered:

Building a dataset with missing values (nan)
Using transform to impute missing values with per-group medians
group_by for survival rate computation
Cross-tabulation by grouping on multiple columns
transform to broadcast group statistics back to every row
partition for parallel dispatch
iterate_by for custom per-group logic