1. Introduction

In Part 1, a simple model was built using single binary split called OneR Classifier.

In Part 2, sklearn DecisionTreeClassifier framework was used and by setting a sample limit per node, loss was reduced.

In Part 3, we used the concept of bagging by averaging predictions from many big trees to create a random forest.

Today, we’ll create a Feature Importance Plot very easily and quickly.

In the next post I’ll go into:

Gradient Boosting (sum of trees ) Decision Tree or Machines (GBMs)

2. Training and Validation Sets

from fastai.imports import *
import torch, numpy as np, pandas as pd
import kaggle, zipfile
from pathlib import Path
path = Path("titanic")
if not path.exists():
    print(f"{path} folder doesn't exist, downloading...")
    kaggle.api.competition_download_cli(str(path))
    zipfile.ZipFile(f"{path}.zip").extractall(path)
else:
    print(f"{path} exists!")
!ls {path}

def proc_data_1(df):
    modes           = df.mode().iloc[0]
    df['Fare']      = df.Fare.fillna(0)
    df.fillna(modes, inplace=True)
    df['LogFare']   = np.log1p(df['Fare'])
    df['Embarked']  = pd.Categorical(df.Embarked)
    df['Sex']       = pd.Categorical(df.Sex)

def convert_cats_to_codes_2(trn_df, val_df, cat_list):
    trn_df[cat_list] = trn_df[cat_list].apply(lambda dfcol: dfcol.cat.codes) # replace with 1 and 0s
    val_df[cat_list] = val_df[cat_list].apply(lambda dfcol: dfcol.cat.codes)
    return trn_df, val_df

from numpy import random
from sklearn.model_selection import train_test_split
random.seed(42)

# 0 get raw data
df              = pd.read_csv(path/'train.csv')
tst_df          = pd.read_csv(path/'test.csv')
# 1. clean data ([replace nas with mode], [logfare], [sex/embarked to cat])
proc_data_1(df)
proc_data_1(tst_df)

# 2. split training data: training and validation set
trn_df,val_df   = train_test_split(df, test_size=0.25)

# 3. convert cats to codes
cat_list        = ["Sex","Embarked"]
trn_df, val_df  = convert_cats_to_codes_2(trn_df, val_df, cat_list)

# 4. get idep and deps
dep_col         = "Survived"
cont_list       = ['Age', 'SibSp', 'Parch', 'LogFare',"Pclass"]
def get_trn_and_val_idep_dep(df):
    idep    = df[ cat_list + cont_list ].copy()
    dep     = df[dep_col]
    return idep, dep

trn_idep,trn_dep = get_trn_and_val_idep_dep(trn_df)
val_idep,val_dep = get_trn_and_val_idep_dep(val_df)

titanic exists!
gender_submission.csv  test.csv  train.csv

3. Build Decision Tree Classifier

from sklearn.tree import DecisionTreeClassifier
dtc_min50 = DecisionTreeClassifier(min_samples_leaf=50)

4. Fit Decision Tree to our Training Data

dtc_min50.fit(trn_idep, trn_dep)

DecisionTreeClassifier(min_samples_leaf=50)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

5. Create Feature Importance Plot

pd.DataFrame(dict(cols=trn_idep.columns, imp=dtc_min50.feature_importances_)).plot('cols', 'imp', 'barh')

6. Completed

As expected, Sex and Pclass are the most important features to survivability on the Titanic.