## Description

STAT 435

Homework # 8

Online Submission Via Canvas

Instructions: You may discuss the homework problems in small groups, but you

must write up the final solutions and code yourself. Please turn in your code for the

problems that involve coding. However, for the problems that involve coding, you

must also provide written answers: you will receive no credit if you submit code without written answers. You might want to use Rmarkdown to prepare your assignment.

1. In this problem, you will fit some models to a data set of your choice.

(a) Find a very large data set of your choice (large n, possibly large p). Select

one quantitative variable to be your response, Y ∈ R. Describe the data.

(b) Grow a very big regression tree to the data. Plot the tree, and report its

residual sum of squares (RSS) on the (training) data.

(c) Now use cost-complexity pruning to prune the tree to have 6 leaves. Plot

the pruned tree, and report its RSS on the (training) data. How does this

compare to the RSS obtained in (b)? Explain your answer.

(d) Perform cross-validation to estimate the test error, as the tree is pruned

using cost-complexity pruning. Plot the estimated test error, as a function

of tree size. The tree size should be on the x-axis and the estimated test

error should be on the y-axis.

(e) Plot the “best” tree (with size chosen by cross-validation in (d)), fit to all

of the data. Report its RSS on the (training) data.

(f) Perform bagging, and estimate its test error.

(g) Fit a random forest, and estimate its test error.

(h) Which method (regression tree, bagging, random forest) results in the

smallest estimated test error? Comment on your results.

2. In this problem, we will consider fitting a regression tree to some data with

p = 2.

(a) Find a data set with n large, p = 2 features, and Y ∈ R. It’s OK to just

use the data from Question 1 with just two of the features.

1

(b) Grow a regression tree with 8 terminal nodes. Plot the tree.

(c) Now make a plot of feature space, showing the partition corresponding to

the tree in (b). The axes should be X1 and X2. Your plot should contain

vertical and horizontal line segments indicating the regions corresponding

to the leaves in the tree from (b). Superimpose a scatterplot of the n

observations onto this plot. This should look something like Figure 8.2 in

the textbook. Label each region with the prediction for that region.

Note: If you want, you can plot the horizontal and vertical line segments in (c)

by hand (instead of figuring out how to plot them in R).

3. This problem has to do with bagging.

(a) Consider a single regression tree with just two terminal nodes (leaves).

Suppose that the single internal node splits on X1 < c. If X1 < c then a

prediction of 13.9 is made; if X1 ≥ c then a prediction of 3.4 is made. Write

out an expression for f(·) in the regression model Y = f(X1, . . . , Xp) + ?

corresponding to this tree.

(b) Now suppose you bag some regression trees, each of which contain just

two terminal nodes (leaves). Show that this results in an additive model,

i.e. a model of the form

Y =

X

p

j=1

fj (Xj ) + ?.

(c) Now suppose you perform bagging with larger regression trees, each of

which has at least three terminal nodes (leaves). Does this result in an

additive model? Explain your answer.

4. If you’ve paid attention in class, then you know that in statistics, there is no

free lunch: depending on the form of the function f(·) in the regression model

Y = f(X1, . . . , Xp) + ?,

a given statistical machine learning algorithm might work very well, or not well

at all. You will now demonstrate this in a simulation with p = 2 and n = 1000.

(a) Generate X1, X2, and ? as

x1 <- sample(seq(0,10,len=1000))

x2 <- sample(seq(0,10,len=1000))

eps <- rnorm(1000)

If you generate Y according to the model Y = f(X1, X2) + ?, then what

will be the value of the irreducible error?

2

(b) Give an example of a function f(·) for which a least squares regression

model fit to (x1, y1), . . . ,(xn, yn) can be expected to outperform a regression tree fit to (x1, y1), . . . ,(xn, yn), in terms of expected test error. Explain why you expect the least squares regression model to work better

for this choice of f(·).

(c) Now calculate Y = f(X1, X2) + ? in R using the x1, x2, eps generated in

(a), and the function f(·) specified in (b). Estimate the test error for a

least squares regression model, and the test error for a regression tree (for

a number of values of tree size), and display the results in a plot. The

plot should show tree size on the horizontal axis and estimated test error

on the vertical axis; the estimated test error for the linear model should

be plotted as a horizontal line (since it isn’t a function of tree size). Your

result should agree with your intuition from (b).

(d) Now repeat (b), but this time find a function for which the regression tree

can be expected to outperform the least squares model.

(e) Now repeat (c), this time using the function from (d).

3