## Description

STAT 435

Homework # 7

Online Submission Via Canvas

Instructions: You may discuss the homework problems in small groups, but you

must write up the final solutions and code yourself. Please turn in your code for the

problems that involve coding. However, for the problems that involve coding, you

must also provide written answers: you will receive no credit if you submit code without written answers. You might want to use Rmarkdown to prepare your assignment.

1. For this problem, you will analyze a data set of your choice, not taken from the

ISLR package. Choose a data set that has n ? p, since you will apply methods

from Chapter 7 to this data. You will also need to have p 1. Throughout this

problem, make sure to label your axes appropriately, and to include legends

when needed.

(a) Describe the data in words. Where did you get it from, and what is

the data about? You will perform supervised learning on this data, so

you must identify a response, Y , and features, X1, . . . , Xp. What are the

values of n and p? Describe the response and the features (e.g. what are

they measuring; are they quantitative or qualitative?).

(b) Fit a generalized additive model, Y = f1(X1) + . . . + fp(Xp) + ?. Use

cross-validation to choose the level of complexity. For j = 1, . . . , p, make

a scatterplot of Xj against Y , and plot ˆfj (Xj ). Comment on your results

and on the choices you made in fitting this model.

(c) Now fit a linear model, Y = β0 + β1X1 + . . . + βpXp + ?. For j = 1, . . . , p,

display the linear fit (Xjβˆ

j ) on top of a scatterplot of Xj against Y .

(d) Estimate the test error of the generalized additive model and the test error

of the linear model. Comment on your results. Which approach gives a

better fit to the data?

2. In this problem, we’ll play around with regression splines.

(a) Generate data as follows:

1

set.seed(7)

x <- 1:1000

y <- sin((1:1000)/100)*4+rnorm(100)

Consider the model

Y = f(X) + ?.

What is the form of f(X) for this simulation setting? What is the value

of Var(?)? What is the value of E(Y − f(X))2

?

(b) Fit regression splines for various numbers of knots to this simulated data,

in order to get spline fits ranging from very wiggly to very smooth. Make

a plot of your results, showing the raw data, the true function f(X), and

the spline fits. Be sure to include a legend containing relevant information,

and to label the axes appropriately.

(c) Based on visual inspection, how many knots seem to give the “best” fit?

Explain your answer.

(d) Now perform cross-validation in order to select the optimal number of

knots. What is the “best” number of knots? Make a plot displaying the

raw data, the true function f(X), and the spline fit ˆf(X) that uses the

number of knots selected by cross-validation. Be sure to include a legend

and to label the axes appropriately. Comment on your results.

(e) Provide an estimate of the test error, E(Y − ˆf(X))2

, associated with the

spline ˆf(·) from (d). How does this relate to your answer in (a)?

(f) Now fit a linear model of the form

Y = β0 + β1X + ?

to the data instead. Plot the raw data and the fitted model and the true

function f(·). Provide an estimate of the test error associated with the

fitted model. Comment on your results.

2