## Description

STAT 435

Homework # 2

Online Submission Via Canvas

Instructions: You may discuss the homework problems in small groups, but you

must write up the final solutions and code yourself. Please turn in your code for the

problems that involve coding. However, for the problems that involve coding, you

must also provide written answers: you will receive no credit if you submit code without written answers. You might want to use Rmarkdown to prepare your assignment.

1. Suppose we have a quantitative response Y , and a single feature X ∈ R. Let

RSS1 denote the residual sum of squares that results from fitting the model

Y = β0 + β1X + ?

using least squares. Let RSS12 denote the residual sum of squares that results

from fitting the model

Y = β0 + β1X + β2X

2 + ?

using least squares.

(a) Prove that RSS12 ≤ RSS1.

(b) Prove that the R2 of the model containing just the feature X is no greater

than the R2 of the model containing both X and X2

.

2. Describe the null hypotheses to which the p-values in Table 3.4 of the textbook correspond. Explain what conclusions you can draw based on these pvalues. Your explanation should be phrased in terms of sales, TV, radio, and

newspaper, rather than in terms of the coefficients of the linear model.

3. Consider a linear model with just one feature,

Y = β0 + β1X + ?.

Suppose we have n observations from this model, (x1, y1), . . . ,(xn, yn). The

least squares estimator is given in (3.4) of the textbook. Furthermore, we saw

1

in class that if we construct a n × 2 matrix X˜ whose first column is a vector of

1’s and whose second column is a vector with elements x1, . . . , xn, and if we let

y denote the vector with elements y1, . . . , yn, then the least squares estimator

takes the form

?

βˆ

0

βˆ

1

?

=

?

X˜ TX˜

?−1

X˜ T y. (1)

Prove that (1) agrees with equation (3.4) of the textbook, i.e. βˆ

0 and βˆ

1 in (1)

equal βˆ

0 and βˆ

1 in (3.4).

4. This question involves the use of multiple linear regression on the Auto data

set, which is available as part of the ISLR library.

(a) Use the lm() function to perform a multiple linear regression with mpg as

the response and all other variables except name as the predictors. Use

the summary() function to print the results. Comment on the output. For

instance:

i. Is there a relationship between the predictors and the response?

ii. Which predictors appear to have a statistically significant relationship

to the response?

iii. Provide an interpretation for the coefficient associated with the variable year.

Make sure that you treat the qualitative variable origin appropriately.

(b) Try out some models to predict mpg using functions of the variable horsepower.

Comment on the best model you obtain. Make a plot with horsepower

on the x-axis and mpg on the y-axis that displays both the observations

and the fitted function (i.e. ˆf(horsepower)).

(c) Now fit a model to predict mpg using horsepower, origin, and an interaction between horsepower and origin. Make sure to treat the qualitative

variable origin appropriately. Comment on your results. Provide a careful interpretation of each regression coefficient.

5. Consider fitting a model to predict credit card balance using income and

student, where student is a qualitative variable that takes on one of three

values: student∈ {graduate, undergraduate, not student}.

(a) Encode the student variable using two dummy variables, one of which

equals 1 if student=graduate (and 0 otherwise), and one of which equals

1 if student=undergraduate (and 0 otherwise). Write out an expression

for a linear model to predict balance using income and student, using

this coding of the dummy variables. Interpret the coefficients in this linear

model.

(b) Now encode the student variable using two dummy variables, one of which

equals 1 if student=not student (and 0 otherwise), and one of which

2

equals 1 if student=graduate (and 0 otherwise). Write out an expression

for a linear model to predict balance using income and student, using

this coding of the dummy variables. Interpret the coefficients in this linear

model.

(c) Using the coding in (a), write out an expression for a linear model to predict balance using income, student, and an interaction between income

and student. Interpret the coefficients in this model.

(d) Using the coding in (b), write out an expression for a linear model to predict balance using income, student, and an interaction between income

and student. Interpret the coefficients in this model.

(e) Using simulated data for balance, income, and student, show that the

fitted values (predictions) from the models in (a)–(d) do not depend on

the coding of the dummy variables (i.e. the models in (a) and (b) yield

the same fitted values, as do the models in (c) and (d)).

6. Extra Credit. Consider a linear model with just one feature,

Y = β0 + β1X + ?,

with E(?) = 0 and Var(?) = σ

2

. Suppose we have n observations from this

model, (x1, y1), . . . ,(xn, yn). We assume that x1, . . . , xn are fixed, so the only

randomness in the model comes from ?1, . . . , ?n. Use (3.4) in the textbook

— or, if you prefer, the matrix algebra formulation in (1) of this homework

assignment — in order to derive the expressions for Var(βˆ

0) and Var(βˆ

1) given

in (3.8) of the textbook.

3