1. Introduction

In order to build Random Forests, we need to build Decision Trees.
In order to build Decisions-Trees, we need to build Binary Splits.

This post will show how to do find the best binary split per column, also known as OneR Classifier

2. Data Cleaning

from fastai.imports import *
import torch, numpy as np, pandas as pd
import kaggle, zipfile
from pathlib import Path
path = Path("titanic")
if not path.exists():
    print(f"{path} folder doesn't exist, downloading...")
    kaggle.api.competition_download_cli(str(path))
    zipfile.ZipFile(f"{path}.zip").extractall(path)
else:
    print(f"{path} exists!")
!ls {path}

titanic exists!
gender_submission.csv  test.csv  train.csv

def proc_data_1(df):
    modes           = df.mode().iloc[0]
    df['Fare']      = df.Fare.fillna(0)
    df.fillna(modes, inplace=True)
    df['LogFare']   = np.log1p(df['Fare'])
    df['Embarked']  = pd.Categorical(df.Embarked)
    df['Sex']       = pd.Categorical(df.Sex)

def convert_cats_to_codes_2(trn_df, val_df, cat_list):
    trn_df[cat_list] = trn_df[cat_list].apply(lambda dfcol: dfcol.cat.codes) # replace with 1 and 0s
    val_df[cat_list] = val_df[cat_list].apply(lambda dfcol: dfcol.cat.codes)
    return trn_df, val_df

from numpy import random
from sklearn.model_selection import train_test_split
random.seed(42)

# 0 get raw data
df              = pd.read_csv(path/'train.csv')
tst_df          = pd.read_csv(path/'test.csv')
# 1. clean data ([replace nas with mode], [logfare], [sex/embarked to cat])
proc_data_1(df)
proc_data_1(tst_df)

# 2. split training data: training and validation set
trn_df,val_df   = train_test_split(df, test_size=0.25)

# 3. convert cats to codes
cat_list        = ["Sex","Embarked"]
trn_df, val_df  = convert_cats_to_codes_2(trn_df, val_df, cat_list)

# 4. get idep and deps
dep_col         = "Survived"
cont_list       = ['Age', 'SibSp', 'Parch', 'LogFare',"Pclass"]
def get_trn_and_val_idep_dep(df):
    idep    = df[ cat_list + cont_list ].copy()
    dep     = df[dep_col]
    return idep, dep

trn_idep,trn_dep = get_trn_and_val_idep_dep(trn_df)
val_idep,val_dep = get_trn_and_val_idep_dep(val_df)

3. Binary Splits

A binary split is where all rows are placed into one of two groups, based on whether they’re above or below some threshold of some column.

4. 1R Classifier model

In laymens:
1. Get all unique values of each idependent value.
2. Split on the value, ie. binary split.
3. Make predictions on survivability using the above split.
4. Calculate standard deviation for each split and add them.
5. If std.dev is high, than its a bad split since survived and perished within each split. A good split results in low-variability.
6. find the split point for each column with lowest std.dev.
7. This is the 1R model.

5. Code

def _side_score(side, y):
    tot = side.sum()
    if tot<=1: return 0
    return y[side].std()*tot

def score(idep_col, dep, split_val):
    lhs_bool_list = idep_col <= split_val
    return (_side_score(lhs_bool_list, dep) + _side_score(~lhs_bool_list, dep)) / len(dep)

def min_col(df, idep_col_name):
    idep_col    = df[idep_col_name]
    dep         = df[dep_col]

    col_uniques = idep_col.dropna().unique() # get all unique values of idep col
    
    scores = np.array( # get score for each unique value in idep_col
        [score(idep_col, dep, col_val) 
         for col_val in col_uniques 
         if not np.isnan(col_val)
         ])
    
    idx = scores.argmin() # get index of min score
    return col_uniques[idx],scores[idx]
all_cols = cat_list+cont_list 
{col:min_col(trn_df, col) for col in all_cols}

{'Sex': (0, 0.40787530982063946),
 'Embarked': (0, 0.47883342573147836),
 'Age': (6.0, 0.478316717508991),
 'SibSp': (4, 0.4783740258817434),
 'Parch': (0, 0.4805296527841601),
 'LogFare': (2.4390808375825834, 0.4620823937736597),
 'Pclass': (2, 0.46048261885806596)}

6. The Best Binary-Split

Thus, Sex<=0 is best single binary split.