## Description

Machine Learning

Homework #1

Reminder: While you are encouraged to think about problems in small groups, all written solutions

must be independently generated. Please type or hand-write solutions legibly. While these questions require thought, please be as concise and clear in your answer as possible. Please address all questions to

http://piazza.com/class#winter2020/eecs545 with a reference to the specific question in the subject

line (E.g. RE: Homework 1, Q2(c)). For any solutions that require programming, please use Numpy to

implement the machine learning algorithms from scratch. Please submit code as well as the written solution

with any figures you are required to plot.

1 [21 points] Logistic regression

(a) [8 points] Consider the log-likelihood function for logistic regression:

`(w) = X

N

i=1

y

(i)

log h(x

(i)

) + (1 − y

(i)

) log(1 − h(x

(i)

)),

where h(x) = sigmoid(wT x) = 1

1+exp(−wT x)

.

Find the Hessian H of this function, and show that is negative semi-definite and thus ` is concave and

has no local maxima other than the global one. That is, show that

z

T Hz ≤ 0

for any vector z. [Hint: You might want to start by showing the fact that P

i

P

j

zixixj zj = (x

T z)

2 ≥ 0.]

(b) [8 points] Using the H you calculated in part (a), write down the update rule implied by Newton’s

method for optimizing `(w). Now use this rule (and not a library function) to implement Newton’s

method and apply it to binary classification problem specified in files q1x.npy and q1y.npy. The two

columns of q1x.npy represent the inputs (x

(i)

) and q1y.npy represents the outputs (y

(i) ∈ {0, 1}), with

one training example per row. Initialize Newtons method with w = 0 (the vector of all zeros). What

are the coefficients w, including the intercept term, resulting from your fit? [Hint: You might refer to

the code in q1 hint.py to start programming.]

(c) [5 points] Plot the training data (your axes should be x1 and x2, corresponding to the two coordinates

of the inputs, and you should use a different symbol for each point plotted to indicate whether that

example had label 1 or 0). Also plot on the same figure the decision boundary fit by logistic regression.

(I.e., this should be a straight line showing the boundary separating the region where h(x) 0.5 from

where h(x) ≤ 0.5.)

1

2 [32 points] Linear regression on a polynomial

The files q2xTrain.npy, q2xTest.npy, q2yTrain.npy and q2yTest.npy specify a linear regression problem

for a polynomial. q2xTrain.npy represent the inputs (x

(i) ∈ R) and q2yTrain.npy represents the outputs

(y

(i) ∈ R) of the training set, with one training example per row.

(a) [12 points] For each of the following optimization methods, find the coefficients (slope and intercept)

that minimize the error for a first degree polynomial.

• Batch gradient descent

• Stochastic gradient descent

• Newton’s method

i. [9 points] Give the coefficients generated by each of the three optimization methods.

ii. [3 points] How did the methods compare in terms of rate of convergence? [Hint: The training

process can be viewed as convergent when the mean squared error on the training dataset is low

enough (e.g ≤ 0.2) and stable.]

(b) [10 points] Next, you will investigate the problem of over-fitting. Recall the figure from lecture that

explored over-fitting as a function of the degree of the polynomial M, where the Root-Mean-Square

(RMS) Error is define as

ERMS =

p

2E(w∗)/N,

where

E(w) = 1

2

X

N

i=1

(

X

M

j=0

wjφj (x

(i)

) − y

(i)

)

2 =

1

2

X

N

i=1

(wT φ(x

(i)

) − y

(i)

)

2

:

i. [7 points] Using Newton’s method, find the coefficients that minimize the error for a M-degree

polynomial (for M = 0 . . . 9) for the training data specified in q2xTrain.npy and q2yTrain.npy.

Now use these parameters to regenerate the above chart, using q2xTest.npy and q2yTest.npy as

the test data.

2

ii. [3 points] Which degree polynomial would you say best fits the data? Was there evidence of

under/over-fitting the data? Use your generated charts to defend your answer.

(c) [10 points] Finally, you will explore the role of regularization. Recall the image from lecture that

explored the affect of the regularization factor λ:

i. [7 points] Using Newton’s method, find the coefficients that minimize the error for a ninth degree

polynomial (M = 9) given regularization factor λ (for λ = {0, 10−6

, 10−5

, . . . , 10−1

, 100

(1)}) for

the training data specified in q2xTrain.npy and q2yTrain.npy. Now use these parameters to plot

the ERMS of both the training data and test data as a function of λ (regenerate the above chart,

using q2xTest.npy and q2yTest.npy as the test data). Specifically, use the following regularized

objective function:

1

2

X

N

i=1

(wT φ(x

(i)

) − y

(i)

)

2 +

λ

2

||w||2

2

.

for optimizing the parameters w, but please use the original (unregularized) ERMS for plotting.

ii. [3 points] Which λ value seemed to work the best?

3 [24 points] Locally weighted linear regression

Consider a linear regression problem in which we want to weight different training examples differently.

Specifically, suppose we want to minimize

ED(w) = 1

2

X

N

i=1

r

(i)

(wT x

(i) − y

(i)

)

2

.

In class, we worked out what happens for the case where all the weights (the r

(i)

’s) are the same. In this

problem, we will generalize some of those ideas to the weighted setting, and also implement the locally

weighted linear regression algorithm. (Note that the weight r

(i)

can be different for each of the training

example.)

3

(a) [2 points] Show that ED(w) can also be written

ED(w) = (Xw − y)

T R(Xw − y)

for an appropriate diagonal matrix R, and where X is a matrix whose i-th row is x

(i) and y is the vector

whose i-th entry is y

(i)

. State clearly what R is.

(b) [6 points] If all the r

(i)

’s equal 1, then we saw in class that the normal equation is

XT Xw = XT y,

and that the value of w∗

that minimizes ED(w) is given by (XT X)

−1XT y. By finding the derivative

∇wED(w) and setting that to zero, generalize the normal equation to this weighted setting, and give

the new value of w∗

that minimizes ED(w) in closed form as a function of X, R and y.

(c) [5 points] Suppose we have a training set {(x

(i)

, y(i)

);i = 1 . . . , N} of N independent examples, but in

which the y

(i)

’s were observed with differing variances. Specifically, suppose that

p(y

(i)

|x

(i)

; w) = 1

√

2πσ(i)

exp(−

(y

(i) − wT x

(i)

)

2

2(σ

(i))

2

)

I.e., y

(i) has mean wT x

(i) and variance (σ

(i)

)

2

(where the σ

(i)

’s are fixed, known, constants). Show that

finding the maximum likelihood estimate of w reduces to solving a weighted linear regression problem.

State clearly what the r

(i)

’s are in terms of the σ

(i)

’s.

(d) [11 points] The following items will use the files q3x.npy which contains the inputs (x

(i)

) and q3y.npy

which contains the outputs (y

(i)

) for a linear regression problem, with one training example per row.

i. [2 points] Implement (unweighted) linear regression (y = wT x) on this dataset (using the normal

equations), and plot on the same figure the data and the straight line resulting from your fit.

(Remember to include the intercept term.)

ii. [6 points] Implement locally weighted linear regression on this dataset (using the weighted normal

equations you derived in part (b)), and plot on the same figure the data and the curve resulting

from your fit. When evaluating h(·) at a query point x (which is real-valued in this problem), use

weights

r

(i) = exp(−

(x − x

(i)

)

2

2τ

2

)

with a bandwidth parameter τ = 0.8. (Again, remember to include the intercept term.)

iii. [3 points] Repeat (ii) four times with τ = 0.1, 0.3, 2 and 10. Comment briefly on what happens

to the fit when τ is too small or too large.

4 Derivation and Proof

(a) [8 points] Consider the linear regression problem for 1D data, where we would like to learn a function

h(x) = w1x + w0 with parameters w0 and w1 to minimize the sum squared error L =

1

2

PN

i=1(y

(i) −

h(x

(i)

))2

for N pairs of data samples (x

(i)

, y(i)

). Derive the solution for w0 and w1 for this 1D case of

linear regression. Show the derivation to get the solution w0 = Y¯ − w1X¯ and w1 =

1

N

PN

i=1 x

(i)y

(i)−Y¯ X¯

1

N

PN

i=1 x(i)2−X¯ 2

where X¯ is the mean of {x

(1), x(2)

, · · · , x(N)} and Y¯ is the mean of {y

(1), y(2)

, · · · , y(N)}.

4

(b) [15 points] Consider the definition and property of positive (semi-)definite matrix. Let A be a real,

symmetric d × d matrix. A is positive semi-definite (PSD) if, for all z ∈ Rd

, z

T Az ≥ 0. A is positive

definite (PD) if, for all z 6= 0, z

T Az 0. We write A ? 0 when A is PSD, and A ? 0 when

A is PD. The spectral theorem says that every real symmetric matrix A can be expressed via the

spectral decomposition A = UΛUT where U is a d × d matrix such that UUT = UT U = I and

Λ = diag(λ1, λ2, · · · , λd). Multiplying on the right by U we see that AU = UΛ. If we let ui denote the

i-th column of U, we have Aui = λiui for each i. Then λi are eigenvalues of A, and the corresponding

columns are eigenvectors associated to λi

. The eigenvalues constitute the ”spectrum” of A, and the

spectral decomposition is also called the eigenvalue decomposition of A.

i. [6 points] Prove A is PD iff λi 0 for each i.

ii. [9 points] Consider the linear regression problem where Φ and y are as defined in class and the closed

form solution is (ΦT Φ)−1Φ

T y. We can get the eigenvalues of symmetric matrix ΦT Φ using spectral

decomposition. We apply ridge regression, and the symmetric matrix in solution is ΦT Φ+βI. Prove

that the ridge regression has an effect of shifting all singular values by a constant β. For any β 0,

ridge regression makes the matrix ΦT Φ + βI PD.

Credits

Some questions adopted/adapted from http://www.stanford.edu/class/cs229/ps/ps1.pdf and from Bishop

PRML.

5