def _side_score(side, y): tot = side.sum()if tot<=1: return0return y[side].std()*totdef score(idep_col, dep, split_val): lhs_bool_list = idep_col <= split_valreturn (_side_score(lhs_bool_list, dep) + _side_score(~lhs_bool_list, dep)) /len(dep)def calc_best_bin_split_per_col(df, idep_col_name): idep_col = df[idep_col_name] dep = df[dep_col] col_uniques = idep_col.dropna().unique() # get all unique values of idep col scores = np.array( # get score for each unique value in idep_col [score(idep_col, dep, col_val) for col_val in col_uniques ifnot np.isnan(col_val) ]) idx = scores.argmin() # get index of min scorereturn col_uniques[idx],scores[idx]
3. Run 1R Classifier (best binary split) on each training set
3.1 Male Training Set
{col_name:calc_best_bin_split_per_col(trn_males, col_name) for col_name in all_cols}
The above results showing best split for males is:
- Age<=6
and best split for females is:
- Pclass<=2
These extra rules means a decision tree model has been created.
This model will:
1. Check Sex is male or female then
2. Check Age<=6 or Pclass<=2 where appropriate, then make the prediction.
By repeating the process, more additional rules can be created for each of 4 groups just created.
This can be manually, or we can use a framework from sklearn, the model is called DecisionTreeClassifier
5. sklearn.tree: DecisionTreeClassifier
Using sklearn module, we can use DecisionTreeClassifier module to build our decision tree.
5.1 Max 4 nodes
Start with maximum 4 nodes.
from sklearn.tree import DecisionTreeClassifier, export_graphvizm_DecTree_max4nodes = DecisionTreeClassifier(max_leaf_nodes=4).fit(trn_idep, trn_dep)
The model applied 1R to two levels and determined the same splits I did. Graph terminology: - colour: Blue is high survival rate, Orange is low survival rate. - samples: rows matching the set of rules - values: how many survived or perished, hence two values. - gini: measure of impurity, 1 means the whole group is the same, 0 means all rows are different
draw_tree(m_DecTree_max4nodes, trn_idep, size=10)
5.2.1 Loss (DTree, max 4 leaf nodes)
MAE 22.4%
from sklearn.metrics import mean_absolute_errormean_absolute_error(val_dep, m_DecTree_max4nodes.predict(val_idep))