1. Introduction

In Part 1, a simple model was built using single binary split called OneR Classifier.

In Part 2, sklearn DecisionTreeClassifier framework was used and by setting a sample limit per node, loss was reduced

In this post:

Create alot of bigger trees
Take the average of their predictions, that is, the averaged emsemble or bagging results is a random forest
Compare the results with sklearn’s RandomForestClassifier

In the next few posts, the topics will follows:

Feature Importance Plot
Gradient Boosting (sum of trees) Decision Tree or Machine

2. Training and Validation Sets

from fastai.imports import *
import torch, numpy as np, pandas as pd
import kaggle, zipfile
from pathlib import Path
path = Path("titanic")
if not path.exists():
    print(f"{path} folder doesn't exist, downloading...")
    kaggle.api.competition_download_cli(str(path))
    zipfile.ZipFile(f"{path}.zip").extractall(path)
else:
    print(f"{path} exists!")
!ls {path}

def proc_data_1(df):
    modes           = df.mode().iloc[0]
    df['Fare']      = df.Fare.fillna(0)
    df.fillna(modes, inplace=True)
    df['LogFare']   = np.log1p(df['Fare'])
    df['Embarked']  = pd.Categorical(df.Embarked)
    df['Sex']       = pd.Categorical(df.Sex)

def convert_cats_to_codes_2(trn_df, val_df, cat_list):
    trn_df[cat_list] = trn_df[cat_list].apply(lambda dfcol: dfcol.cat.codes) # replace with 1 and 0s
    val_df[cat_list] = val_df[cat_list].apply(lambda dfcol: dfcol.cat.codes)
    return trn_df, val_df

from numpy import random
from sklearn.model_selection import train_test_split
random.seed(42)

# 0 get raw data
df              = pd.read_csv(path/'train.csv')
tst_df          = pd.read_csv(path/'test.csv')
# 1. clean data ([replace nas with mode], [logfare], [sex/embarked to cat])
proc_data_1(df)
proc_data_1(tst_df)

# 2. split training data: training and validation set
trn_df,val_df   = train_test_split(df, test_size=0.25)

# 3. convert cats to codes
cat_list        = ["Sex","Embarked"]
trn_df, val_df  = convert_cats_to_codes_2(trn_df, val_df, cat_list)

# 4. get idep and deps
dep_col         = "Survived"
cont_list       = ['Age', 'SibSp', 'Parch', 'LogFare',"Pclass"]
def get_trn_and_val_idep_dep(df):
    idep    = df[ cat_list + cont_list ].copy()
    dep     = df[dep_col]
    return idep, dep

trn_idep,trn_dep = get_trn_and_val_idep_dep(trn_df)
val_idep,val_dep = get_trn_and_val_idep_dep(val_df)

titanic folder doesn't exist, downloading...
Downloading titanic.zip to /home/tonydevs/github/blog/posts/2024-04-26-random_forest

100%|██████████| 34.1k/34.1k [00:00<00:00, 55.8kB/s]


gender_submission.csv  test.csv  train.csv

3. Using `DecisionTreeClassifier`

from sklearn.tree import DecisionTreeClassifier, export_graphviz

def get_tree(prop=0.75):
    n = len(trn_dep)
    idxs = random.choice(n, int(n*prop))
    return DecisionTreeClassifier(min_samples_leaf=5).fit(trn_idep.iloc[idxs], trn_dep.iloc[idxs])

# create as many trees as we want
trees = [get_tree() for t in range(100)]

# average them
all_probs = [t.predict(val_idep) for t in trees]
avg_probs = np.stack(all_probs).mean(0)

from sklearn.metrics import mean_absolute_error

mean_absolute_error(val_dep, avg_probs)

0.2272645739910314

4. Using `RandomForestClassifier`

This is nearly identical to what sklearn’s RandomForestClassifier does.
The main extra piece in a “real” random forest (is that as well as choosing a random sample of data for each tree):

it also picks a random subset of columns for each split.

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(100, min_samples_leaf=5)
rf.fit(trn_idep, trn_dep);
mean_absolute_error(val_dep, rf.predict(val_idep))

0.18834080717488788

1. Introduction

2. Training and Validation Sets

3. Using DecisionTreeClassifier

4. Using RandomForestClassifier

3. Using `DecisionTreeClassifier`

4. Using `RandomForestClassifier`