Sale!

# STAT 435 Homework # 5 solution

\$30.00

STAT 435
Homework # 5

Online Submission Via Canvas
Instructions: You may discuss the homework problems in small groups, but you
must write up the final solutions and code yourself. Please turn in your code for the
problems that involve coding. However, for the problems that involve coding, you
must also provide written answers: you will receive no credit if you submit code without written answers. You might want to use Rmarkdown to prepare your assignment.
1. In this exercise, you will generate simulated data, and will use this data to
perform best subset selection.

Category:

## Description

STAT 435
Homework # 5

Online Submission Via Canvas
Instructions: You may discuss the homework problems in small groups, but you
must write up the final solutions and code yourself. Please turn in your code for the
problems that involve coding. However, for the problems that involve coding, you
must also provide written answers: you will receive no credit if you submit code without written answers. You might want to use Rmarkdown to prepare your assignment.
1. In this exercise, you will generate simulated data, and will use this data to
perform best subset selection.
(a) Use the rnorm() function to generate a predictor X of length n = 100,
and a noise vector ? of length n = 100.
(b) Generate a response vector Y of length n = 100 according to the model
Y = 3 − 2X + X
2 + ?.
(c) Use the regsubsets() function to perform best subset selection, considering X, X2
, . . . , X7 as candidate predictors. Make a plot like Figure 6.2
in the textbook. What is the overall best model according to Cp, BIC,
? Report the coefficients of the best model obtained.
(d) Repeat (c) using forward stepwise selection instead of best subset selection.
(e) Repeat (c) using backward stepwise selection instead of best subset selection.
Hint: You may need to use the data.frame() function to create a single data
set containing both X and Y .
2. In class, we discussed the fact that if you choose a model using stepwise selection
on a data set, and then fit the selected model using least squares on the same
data set, then the resulting p-values output by R are highly misleading. We’ll
now see this through simulation.
1
(a) Use the rnorm() function to generate vectors X1, X2, . . . , X100 and ?, each
of length n = 1000. (Hint: use the matrix() function to create a 1000 ×
100 data matrix.)
(b) Generate data according to
Y = β0 + β1X1 + . . . + β100X100 + ?,
where β1 = . . . = β100 = 0.
(c) Fit a least squares regression model to predict Y using X1, . . . , Xp. Make a
histogram of the p-values associated with the null hypotheses H0j
: βj = 0
for j = 1, . . . , 100.
Hint: You can easily access these p-values using the command
(summary(lm(y~X)))\$coef[,4].
(d) Recall that under H0j
: βj = 0, we expect the p-values to have a Unif[0, 1]
distribution. In light of this fact, comment on your results in (c). Do any
of the features appear to be significantly associated with the response?
(e) Perform forward stepwise selection in order to identify M2, the best twovariable model. (For this problem, there is no need to calculate the best
model Mk for k 6= 2.) Then fit a least squares regression model to the
data, using just the features in M2. Comment on the p-values obtained
for the coefficients.
(f) Now generate another 1000 observations by repeating the procedure in (a)
and (b). Using the new observations, fit a least squares linear model to
predict Y using just the features in M2 calculated in (e). (Do not perform
forward stepwise selection again using the new observations! Instead, take
the M2 obtained earlier in this problem.) Comment on the p-values for
the coefficients. How do they compare to the p-values in (e)?
(g) Are the features in M2 significantly associated with the response? Justify
THE BOTTOM LINE: If you showed a friend the p-values obtained in (e),
without explaining that you obtained M2 by performing forward stepwise selection on this same data, then he or she might incorrectly conclude that the
features in M2 are highly associated with the response.
3. Let’s consider doing least squares and ridge regression under a very simple
setting, in which p = 1, and Pn
i=1 yi =
Pn
i=1 xi = 0. We consider regression
without an intercept. (It’s usually a bad idea to do regression without an
intercept, but if our feature and response each have mean zero, then it is okay
to do this!)
(a) The least squares solution is the value of β ∈ R that minimizes
Xn
i=1
(yi − βxi)
2
.
2
Write out an analytical (closed-form) expression for this least squares
solution. Your answer should be a function of x1, . . . , xn and y1, . . . , yn.
Hint: Calculus!!
(b) For a given value of λ, the ridge regression solution minimizes
Xn
i=1
(yi − βxi)
2 + λβ2
.
Write out an analytical (closed-form) expression for the ridge regression
solution, in terms of x1, . . . , xn and y1, . . . , yn and λ.
(c) Suppose that the true data-generating model is
Y = 3X + ?,
where ? has mean zero, and X is fixed (non-random). What is the expectation of the least squares estimator from (a)? Is it biased or unbiased?
(d) Suppose again that the true data-generating model is Y = 3X + ?, where
? has mean zero, and X is fixed (non-random). What is the expectation of
the ridge regression estimator from (b)? Is it biased or unbiased? Explain
how the bias changes as a function of λ.
(e) Suppose that the true data-generating model is Y = 3X + ?, where ?
has mean zero and variance σ
2
, and X is fixed (non-random), and also
Cov(?i
, ?i
0)= 0 for all i 6= i
0
. What is the variance of the least squares
estimator from (a)?
(f) Suppose that the true data-generating model is Y = 3X + ?, where ?
has mean zero and variance σ
2
, and X is fixed (non-random), and also
Cov(?i
, ?i
0)= 0 for all i 6= i
0
. What is the variance of the ridge estimator
from (b)? How does the variance change as a function of λ?
(g) In light of your answers to parts (d) and (f), argue that λ in ridge regression allows us to control model complexity by trading off bias for variance.
Hint: For this problem, you might want to brush up on some basic properties
of means and variances! For instance, if Cov(Z, W) = 0, then V ar(Z + W) =
V ar(Z) + V ar(W). And if a is a constant, then V ar(aW) = a
2V ar(W), and
V ar(a + W) = V ar(W).
4. Suppose that you collect data to predict Y (height in inches) using X (weight
in pounds). You fit a least squares model to the data, and you get
Yˆ = 3.1 + 0.57X.
(a) Suppose you decide that you want to measure weight in ounces instead
of pounds. Write out the least squares model for predicting Y using
X˜ (weight in ounces). (You should calculate the coefficient estimates
explicitly.) Hint: there are 16 ounces in a pound!
3
(b) Consider fitting a least squares model to predict Y using X and X˜. Let β
denote the coefficient for X in the least squares model, and let β˜ denote
the coefficient for X˜. Argue that any equation of the form
Yˆ = 3.1 + βX + β˜X, ˜
where β + 16β˜ = 0.57, is a valid least squares model.
(c) Suppose that you use ridge regression to predict Y using X, using some
value of λ, and obtain the fitted model
Yˆ = 3.1 + 0.4X.
Now consider fitting a ridge regression model to predict Y using X˜, again
using that same value of λ. Will the coefficient of X˜ be equal to 0.4/16,
(d) For the same value of λ considered in (c), suppose you perform ridge regression to predict Y using X, and separately you perform ridge regression
to predict Y using X˜. Which fitted model will have smaller residual sum
(e) Finally, suppose you use ridge regression to predict Y using X and X˜,
using some value of λ (not necessarily the same value of λ used in (d)),
and obtain the fitted model
Yˆ = 3.17 + 0.03X + 0.03X. ˜
Claim: Any equation of the form
Yˆ = 3.17 + βX + β˜X, ˜
where β+16β˜ = 0.03+16×0.03 = 0.51, is a valid ridge regression solution
for that value of λ.
(f) Argue that your answers to the previous sub-problems support the following claim:
Claim: least squares is scale-invariant, but ridge regression is not.
5. Suppose we wish to fit a linear regression model using least squares. Let
MBSS
k
,MFW D
k
,MBW D
k denote the best k-feature models in the best subset,
forward stepwise, and backward stepwise selection procedures. (For notational
details, see Algorithms 6.1, 6.2, and 6.3 of the textbook.)
Recall that the training set residual sum of squares (or RSS for short) is defined
as Pn
i=1(yi − yˆi)
2
.
For each claim, fill in the blank with one of the following: “less than”, “less
than or equal to”, “greater than”, “greater than or equal to”, “equal to”. Say
“not enough information to tell” if it is not possible to complete the sentence
4
(a) Claim: The RSS of MFW D
1
is the RSS of MBW D
1
.
(b) Claim: The RSS of MFW D
0
is the RSS of MBW D
0
.
(c) Claim: The RSS of MFW D
1
1
.
(d) Claim: The RSS of MFW D
2
1
.
(e) Claim: The RSS of MBW D
1
1
.
(f) Claim: The RSS of MBW D
p
p
.
(g) Claim: The RSS of MBW D
p−1
p−1
.
(h) Claim: The RSS of MBW D
4
4
.
(i) Claim: The RSS of MBW D
4
is the RSS of MFW D
4
.
(j) Claim: The RSS of MBW D
4
is the RSS of MBW D
3
.
6. This problem is extra credit!!!! Let y denote an n-vector of response values,
and let X denote an n × p design matrix. We can write the ridge regression
problem as
minimizeβ∈Rp
?
ky − Xβk
2 + λkβk
2

,
where we are omitting the intercept for convenience. Derive an analytical