3
$\begingroup$

The underlined sentences below from p. 331 in An Introduction to Statistical Learning have me scratching my head: Given that the splitting algorithm always finds the best next split in terms of error reduction, how could it be possible for there to be a worthless split followed by a good split later on? And isn't setting a threshold the default means to implement a complexity cost -- for example the cp parameter in rpart()?

tree pruning, p. 331

$\endgroup$
1
  • 2
    $\begingroup$ this is the nature of a "greedy" algorithm. I would reread the whole section, so you are clear what "greedy" means $\endgroup$ Commented Mar 22, 2024 at 8:51

1 Answer 1

7
$\begingroup$

It's quite possible for a bad split to be followed by a good split. Consider two predictor variables centered on (0,0), where the outcome variable $Y$ is 1 if the variables have the same sign and 0 if they have different signs. No single split will predict usefully, but a split on the first predictor followed by a split on the second predictor can be highly accurate.

That sort of scenario is why pruning can be helpful (and why backwards selection can beat forwards selection).

Even with pruning, tree building doesn't find the optimal tree, but pruning does allow it to (sometimes) find better trees than simple forward selection would.

$\endgroup$
4
  • $\begingroup$ I might be getting lost between the theory of backwards selection and rpart(): according to the docs the cp parameter sets a minimum threshold by which R-squared must increase for a split to be attempted. "The main role of this parameter is to save computing time by pruning off splits that are obviously not worthwhile. Essentially,the user informs the program that any split which does not improve the fit by cp will likely be pruned off by cross-validation, and that hence the program need not pursue it." If I am understanding correctly this is a forwards selection process. $\endgroup$ Commented Mar 22, 2024 at 16:23
  • 1
    $\begingroup$ Yes, and you want to set that threshold rather lower than the pruning threshold. The pruning threshold is typically set by cross-validation. $\endgroup$ Commented Mar 23, 2024 at 0:43
  • $\begingroup$ Am I correct that the list of subtrees listed in cptable element of the rpart.object has the same order as the order of the forward splits? If so then with backwards selection you can only undo splits in reverse order in which they were made. That at least is what I'm understanding both from experimenting with rpart() and from the claim that branches get pruned in "a nested and predictable fashion" as α is increased from 0 (ISLR, p. 333), though it's not obvious to me why that's the case. $\endgroup$ Commented Mar 23, 2024 at 19:58
  • 1
    $\begingroup$ No, you can prune any subtree. Also, even if splits were pruned in the opposite order to how they were made, it is still possible to have a good split happen after a bad split, so stopping based on Cp is not equivalent to overfitting then pruning $\endgroup$ Commented Mar 24, 2024 at 20:36

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.