CSCI 335: Fall 2023
Overview
Description
Learning Goals
Resources
In-Class Code
Coursework
Homework and Quizzes
Labs
Projects
Exams
Scale
Commitments
Splitting Decision Tree Nodes
Goal: Find partitions that yield homogeneous sets
Requires a metric for homogeneity
What is the set of possible feature splits?
Boolean features:
Only one possible split
Features with n discrete values:
Create a tree with n branches rather than two branches
Numerically-valued features:
For each concrete value in the training set:
Perform a binary split around that value
From all of the possible splits, we select the split with the highest
gain
Calculating Homogeneity
For each label
i
, calculate the
portion
p
i
.
p
i
is the probability that a member of the set has label
i
Find the total number of elements with label
i
.
Divide by the total number of elements.
Gini Coefficient:
1 - sum(p
i
2
)
, for all labels
i
Calculating the Gain
Calculate the
homogeneity
for each branch
b
of the split (
h
b
), as well as for the set prior to the split (
h
parent
).
The
gain
is
h
parent
- sum(h
b
)
, across all branches
b
High values for the gain indicate that the split creates branches that are relatively more homogenous than the parent.