Can you explain this description of tree pruning in Intro to Statistical Learning?

Question

The underlined sentences below from p. 331 in An Introduction to Statistical Learning have me scratching my head: Given that the splitting algorithm always finds the best next split in terms of error reduction, how could it be possible for there to be a worthless split followed by a good split later on? And isn't setting a threshold the default means to implement a complexity cost -- for example the cp parameter in rpart()?

this is the nature of a "greedy" algorithm. I would reread the whole section, so you are clear what "greedy" means — seanv507
– seanv507, Commented Mar 22, 2024 at 8:51

Richard Hardy · Accepted Answer · 2024-03-22 14:46:24Z

7

It's quite possible for a bad split to be followed by a good split. Consider two predictor variables centered on (0,0), where the outcome variable $Y$ is 1 if the variables have the same sign and 0 if they have different signs. No single split will predict usefully, but a split on the first predictor followed by a split on the second predictor can be highly accurate.

That sort of scenario is why pruning can be helpful (and why backwards selection can beat forwards selection).

Even with pruning, tree building doesn't find the optimal tree, but pruning does allow it to (sometimes) find better trees than simple forward selection would.

edited Mar 22, 2024 at 14:46

Richard Hardy

71.5k14 gold badges129 silver badges292 bronze badges

answered Mar 22, 2024 at 6:59

Thomas Lumley

57k1 gold badge80 silver badges223 bronze badges

$\begingroup$ I might be getting lost between the theory of backwards selection and rpart(): according to the docs the cp parameter sets a minimum threshold by which R-squared must increase for a split to be attempted. "The main role of this parameter is to save computing time by pruning off splits that are obviously not worthwhile. Essentially,the user informs the program that any split which does not improve the fit by cp will likely be pruned off by cross-validation, and that hence the program need not pursue it." If I am understanding correctly this is a forwards selection process. $\endgroup$

Jon
– Jon

2024-03-22 16:23:43 +00:00
Commented Mar 22, 2024 at 16:23
1

$\begingroup$ Yes, and you want to set that threshold rather lower than the pruning threshold. The pruning threshold is typically set by cross-validation. $\endgroup$

Thomas Lumley
– Thomas Lumley

2024-03-23 00:43:10 +00:00
Commented Mar 23, 2024 at 0:43
$\begingroup$ Am I correct that the list of subtrees listed in cptable element of the rpart.object has the same order as the order of the forward splits? If so then with backwards selection you can only undo splits in reverse order in which they were made. That at least is what I'm understanding both from experimenting with rpart() and from the claim that branches get pruned in "a nested and predictable fashion" as α is increased from 0 (ISLR, p. 333), though it's not obvious to me why that's the case. $\endgroup$

Jon
– Jon

2024-03-23 19:58:19 +00:00
Commented Mar 23, 2024 at 19:58
1

$\begingroup$ No, you can prune any subtree. Also, even if splits were pruned in the opposite order to how they were made, it is still possible to have a good split happen after a bad split, so stopping based on Cp is not equivalent to overfitting then pruning $\endgroup$

Thomas Lumley
– Thomas Lumley

2024-03-24 20:36:57 +00:00
Commented Mar 24, 2024 at 20:36

Add a comment |

Stack Exchange Network

Can you explain this description of tree pruning in Intro to Statistical Learning?

1 Answer 1

Your Answer

Hot Network Questions

Can you explain this description of tree pruning in Intro to Statistical Learning?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions