The underlined sentences below from p. 331 in An Introduction to Statistical Learning have me scratching my head: Given that the splitting algorithm always finds the best next split in terms of error reduction, how could it be possible for there to be a worthless split followed by a good split later on? And isn't setting a threshold the default means to implement a complexity cost -- for example the cp parameter in rpart()?
-
2$\begingroup$ this is the nature of a "greedy" algorithm. I would reread the whole section, so you are clear what "greedy" means $\endgroup$seanv507– seanv5072024-03-22 08:51:34 +00:00Commented Mar 22, 2024 at 8:51
1 Answer
It's quite possible for a bad split to be followed by a good split. Consider two predictor variables centered on (0,0), where the outcome variable $Y$ is 1 if the variables have the same sign and 0 if they have different signs. No single split will predict usefully, but a split on the first predictor followed by a split on the second predictor can be highly accurate.
That sort of scenario is why pruning can be helpful (and why backwards selection can beat forwards selection).
Even with pruning, tree building doesn't find the optimal tree, but pruning does allow it to (sometimes) find better trees than simple forward selection would.
-
$\begingroup$ I might be getting lost between the theory of backwards selection and
rpart(): according to the docs thecpparameter sets a minimum threshold by which R-squared must increase for a split to be attempted. "The main role of this parameter is to save computing time by pruning off splits that are obviously not worthwhile. Essentially,the user informs the program that any split which does not improve the fit by cp will likely be pruned off by cross-validation, and that hence the program need not pursue it." If I am understanding correctly this is a forwards selection process. $\endgroup$Jon– Jon2024-03-22 16:23:43 +00:00Commented Mar 22, 2024 at 16:23 -
1$\begingroup$ Yes, and you want to set that threshold rather lower than the pruning threshold. The pruning threshold is typically set by cross-validation. $\endgroup$Thomas Lumley– Thomas Lumley2024-03-23 00:43:10 +00:00Commented Mar 23, 2024 at 0:43
-
$\begingroup$ Am I correct that the list of subtrees listed in
cptableelement of therpart.objecthas the same order as the order of the forward splits? If so then with backwards selection you can only undo splits in reverse order in which they were made. That at least is what I'm understanding both from experimenting withrpart()and from the claim that branches get pruned in "a nested and predictable fashion" as α is increased from 0 (ISLR, p. 333), though it's not obvious to me why that's the case. $\endgroup$Jon– Jon2024-03-23 19:58:19 +00:00Commented Mar 23, 2024 at 19:58 -
1$\begingroup$ No, you can prune any subtree. Also, even if splits were pruned in the opposite order to how they were made, it is still possible to have a good split happen after a bad split, so stopping based on Cp is not equivalent to overfitting then pruning $\endgroup$Thomas Lumley– Thomas Lumley2024-03-24 20:36:57 +00:00Commented Mar 24, 2024 at 20:36
