SL1 Decision Tree

4 minute read

Published: January 16, 2021

Decision Tree

1. Describe the Representation and Algorithm of DT

1. Representation:

node	edge	leaf
attribute	attribute values	output

2. Algorithm:

Pick the best attribute — roughly split the data into halves
Ask question about the attribute
Follow the correct answer path
Repeat from 1. until the there is only one answer

2. Appropriate Problems for DT

Inputs are attribute-value pairs
Outputs are discrte

3. Describe each step of ID3 algorithm, its hypothesis space, and its bias

ID3 Algorithm:
1. Pick the best attribute A—determine the information gain (reduction of entropy) 1. $Gain(S,A) = Entropy(S)-\sum _{v}(\frac{S_{v}}{S}*Entropy(S_{v}))$ 2. $Entropy(S) = -\sum_{v}Pr(v) * log(v)$ when both of the Pr equals 0.5, Entropy has maximum 1
2. Split the data using different A values, assign the A values as the decision attribute for the node
3. For each node (value of A), create a descendant node
4. Sort training examples to leaves
5. If examples perfectly classified, stop, else, iterate over leaves
Hypothesis space: 1. It’s greedy algorithm which only looks one step ahead, so always maintain one tree hypothesis, and since there is no backtracking, so it doesn’t know any alternative trees that also consistent with data 2. Because no backtracking, it has risk of converging to local optimum, not a global one.
Bias 1. No Restriction bias — all hypothesis space that it will consider for learning 2. Preference bias — subset of hypothesis space, that it will prefer 1. Good split on top of trees 2. Prefer correct ones than worse ones 3. Prefer shorter tress than longer — result of good split on top

4. Other considerations

What if continuous attributes — penalize continuous values
1. Information gain will be S, because the sum of each entropy value is ~ 0
2. If it is continuous or discrete but with many possible values, it may assign high information gain to those variable - S. So in order to penalize, we change information gain $Gain(S,A)$ to $GainRatio(S,A) = \frac{Gain(S,A)}{SplitInformation(S,A)}$ . The splitinformation measures how broadly and evenly partitioned the attribute is.
Does it makes sense to repeat an attribute along the path
1. No if has a finite value
2. But continuous attribute can be tested with different questions (ranges)
Stop criteria?
1. All examples have been perfectly classified
2. No more available attributes
3. When the data has error / noise, can never be perfectly classified 1. Set a validation set and a threshold of error tolerance
What if it’s a regression problem 1. We cannot use information gain over continuous values. Variance can be used as alternative. 2. Need to decide how to report the leaves. Mean is one option.

5. Decision tree Complexity

Construct	Query
O(mnlogn)	O(logn)

6. Pros and Cons

Pros:
1. Simple to understand and interpret — it’s a white box model, any situation is observable and explanation is easily explained by Boolean logic. By contrast, black box model like ANN can be hard to interpret.
2. Require little data preparation — no normalization, no need to deal with blank values.
3. The query time is log of number of examples
4. Able to deal with both regression and classification, including multi-output problems.
Cons:
1. Easy to overfits.
2. DT can not make sure it gives a global optimal tree structure, it may be local optimum.
3. There are some concepts hard to learn, not easy for DT to express. Such as XOR, parity
4. It has bias toward dominating class. Better to balance the dataset before feeding into tree.

7. How to overcome overfitting?

Pruning — pre-pruning(set the stop criteria). post-pruning(check whether can collapse)
Ensemble learning
Cross validation

8. With in Supervised Learning:

Classification	Regression
map input to discrete values (most of the time boolean)	map something to continuous values

9. Decision Tree for AND, OR, XOR

for AND and OR, they are similar but flipped

for tree structure number of nodes: n-OR: O(n) linear n-XOR - parity — if number of true attribute is odd, then true O(2^n) exponential

Extreme Cases： Size of Hypothesis Space: If there is N binary attributes, there will be N！ways of spiting and tree structures, true table will have 2^N rows, the right column (output) on the true table will have 2^(2^N) different bit patterns, Each row represents a path to a leaf

10. Restriction Bias vs Preference Bias

Restriction Bias: set of hypothesis it choose. E.g. decision tree only consider all possible decision tree structures, not considering quadratic equations, not considering non-boolean type.

Preference Bias: More specifically, within the restriction bias, what types of hypothesis it prefer? E.g. given two tree both correct, it prefer (1) good splits at the top (2) prefer correct over incorrect (3) prefer shorter trees than taller trees — this is resulted from pref(1), because if it splits well, it can save extra steps

11. Runtime

Construct a balanced tree: with m features and n samples.
1. For each features, need to sort the data using this feature(logn), then iterate through to decide which splitting value can maximize the information gain
Query time: O(logn)

Share on

Twitter Facebook LinkedIn

Yingjun Mou