## Description

CPSC 340 Assignment 3

Instructions

Rubric: {mechanics:3}

The above points are allocated for following the general homework instructions on the course homepage.

Contents

1 Vectors, Matrices, and Quadratic Functions 1

1.1 Basic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Converting to Matrix/Vector/Norm Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Minimizing Quadratic Functions as Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Robust Regression and Gradient Descent 3

2.1 Weighted Least Squares in One Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Smooth Approximation to the L1-Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3 Robust Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 Linear Regression and Nonlinear Bases 5

3.1 Adding a Bias Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.2 Polynomial Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4 Non-Parametric Bases and Cross-Validation 7

4.1 Proper Training and Validation Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

4.2 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

4.3 Cost of Non-Parametric Bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

5 Very-Short Answer Questions 8

5.1 Essentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

5.2 These ones are optional and not for marks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1 Vectors, Matrices, and Quadratic Functions

The first part of this question makes you review basic operations on vectors and matrices. If you are

rusty on basic vector and matrix operations, see the notes on linear algebra on the course webpage. If you

have time and want a deeper refresher on linear algebra, I have heard good things about the video series

Essence of Linear Algebra at https://youtu.be/kjBOesZCoqc and the e-book Immersive Linear Algebra at

http://immersivemath.com/ila/index.html. We will continue to use linear algebra heavily throughout

the rest of the course.

1.1 Basic Operations

1

Rubric: {reasoning:3}

Using the definitions below,

α = 5, x =

2

−3

, y =

1

4

, z =

2

0

1

, A =

1 2

2 3

3 −2

,

evaluate the following expressions (show your work, but you may use answers from previous parts to simplify

calculations):

1. x

T x.

2. kxk

2

.

3. x

T

(x + αy).

4. Ax

5. z

T Ax

6. AT A.

If {α, β} are scalars, {x, y, z} are real-valued column-vectors of length d, and {A, B, C} are real-valued d × d

matrices, state whether each of the below statements is true or false in general and give a short explanation.

7. yyT y = kyk

2y.

8. x

T AT

(Ay + Az) = x

T AT Ay + z

T AT Ax.

9. x

T

(B + C) = Bx + Cx.

10. (A + BC)

T = AT + C

T BT

.

11. (x − y)

T

(x − y) = kxk

2 − x

T y + kyk

2

.

12. (x − y)

T

(x + y) = kxk

2 − kyk

2

.

Hint: check the dimensions of the result, and remember that matrix multiplication is generally not commutative.

1.2 Converting to Matrix/Vector/Norm Notation

Rubric: {reasoning:2}

Using our standard supervised learning notation (X, y, w) express the following functions in terms of vectors,

matrices, and norms (there should be no summations or maximums).

1. Pn

i=1 |w

T xi − yi

|.

2. maxi∈{1,2,…,n} |w

T xi − yi

| +

λ

2

Pd

j=1 w

2

j

.

3. Pn

i=1 zi(w

T xi − yi)

2 + λ

Pd

j=1 |wj |.

You can use Z to denote a diagonal matrix that has the values zi along the diagonal.

2

1.3 Minimizing Quadratic Functions as Linear Systems

Rubric: {reasoning:3}

Write finding a minimizer w of the functions below as a system of linear equations (using vector/matrix

notation and simplifying as much as possible). Note that all the functions below are convex so finding a w

with ∇f(w) = 0 is sufficient to minimize the functions (but show your work in getting to this point).

1. f(w) = 1

2

kw − vk

2

.

2. f(w) = 1

2

kwk

2 + w

T XT y .

3. f(w) = 1

2

Pn

i=1 zi(w

T xi − yi)

2

.

Above we assume that v is a d × 1 vector.

Hint: Once you convert to vector/matrix notation, you can use the results from class to quickly compute

these quantities term-wise. As a sanity check for your derivation, make sure that your results have the right

dimensions.

2 Robust Regression and Gradient Descent

If you run python main.py -q 2, it will load a one-dimensional regression dataset that has a non-trivial

number of ‘outlier’ data points. These points do not fit the general trend of the rest of the data, and pull

the least squares model away from the main downward trend that most data points exhibit:

0.0 0.2 0.4 0.6 0.8 1.0

5

0

5

10

15

20

25

Least Squares

Note: we are fitting the regression without an intercept here, just for simplicity of the homework question.

In reality one would rarely do this. But here it’s OK because the “true” line passes through the origin (by

design). In Q3.1 we’ll address this explicitly.

3

2.1 Weighted Least Squares in One Dimension

Rubric: {code:3}

One of the most common variations on least squares is weighted least squares. In this formulation, we have

a weight zi for every training example. To fit the model, we minimize the weighted squared error,

f(w) = 1

2

Xn

i=1

zi(w

T xi − yi)

2

.

In this formulation, the model focuses on making the error small for examples i where zi

is high. Similarly,

if zi

is low then the model allows a larger error.

Complete the model class, WeightedLeastSquares, that implements this model (note that Q1.3.3 asks you

to show how this formulation can be solved as a linear system). Apply this model to the data containing

outliers, setting z = 1 for the first 400 data points and z = 0.1 for the last 100 data points (which are the

outliers). Hand in your code and the updated plot.

2.2 Smooth Approximation to the L1-Norm

Rubric: {reasoning:3}

Unfortunately, we typically do not know the identities of the outliers. In situations where we suspect that

there are outliers, but we do not know which examples are outliers, it makes sense to use a loss function that

is more robust to outliers. In class, we discussed using the sum of absolute values objective,

f(w) = Xn

i=1

|w

T xi − yi

|.

This is less sensitive to outliers than least squares, but it is non-differentiable and harder to optimize.

Nevertheless, there are various smooth approximations to the absolute value function that are easy to

optimize. One possible approximation is to use the log-sum-exp approximation of the max function1

:

|r| = max{r, −r} ≈ log(exp(r) + exp(−r)).

Using this approximation, we obtain an objective of the form

f(w)=Xn

i=1

log

exp(w

T xi − yi) + exp(yi − w

T xi)

.

which is smooth but less sensitive to outliers than the squared error. Derive the gradient ∇f of this function

with respect to w. You should show your work but you do not have to express the final result in matrix

notation.

2.3 Robust Regression

Rubric: {code:2,reasoning:1}

The class LinearModelGradient is the same as LeastSquares, except that it fits the least squares model using

a gradient descent method. If you run python main.py -q 2.3 you’ll see it produces the same fit as we

obtained using the normal equations.

1Other possibilities are the Huber loss, or |r| ≈ √

r

2 + for some small .

The typical input to a gradient method is a function that, given w, returns f(w) and ∇f(w). See funObj in

LinearModelGradient for an example. Note that the fit function of LinearModelGradient also has a numerical

check that the gradient code is approximately correct, since implementing gradients is often error-prone.2

An advantage of gradient-based strategies is that they are able to solve problems that do not have closedform solutions, such as the formulation from the previous section. The class LinearModelGradient has most

of the implementation of a gradient-based strategy for fitting the robust regression model under the log-sumexp approximation. The only part missing is the function and gradient calculation inside the funObj code.

Modify funObj to implement the objective function and gradient based on the smooth approximation to the

absolute value function (from the previous section). Hand in your code, as well as the plot obtained using

this robust regression approach.

3 Linear Regression and Nonlinear Bases

In class we discussed fitting a linear regression model by minimizing the squared error. In this question,

you will start with a data set where least squares performs poorly. You will then explore how adding a

bias variable and using nonlinear (polynomial) bases can drastically improve the performance. You will also

explore how the complexity of a basis affects both the training error and the test error. In the final part of

the question, it will be up to you to design a basis with better performance than polynomial bases.

3.1 Adding a Bias Variable

Rubric: {code:3,reasoning:1}

If you run python main.py -q 3, it will:

1. Load a one-dimensional regression dataset.

2. Fit a least-squares linear regression model.

3. Report the training error.

4. Report the test error (on a dataset not used for training).

5. Draw a figure showing the training data and what the linear model looks like.

Unfortunately, this is an awful model of the data. The average squared training error on the data set is over

28000 (as is the test error), and the figure produced by the demo confirms that the predictions are usually

nowhere near the training data:

2Sometimes the numerical gradient checker itself can be wrong. See CPSC 303 for a lot more on numerical differentiation.

5

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0

200

100

0

100

200

300

Least Squares, no bias

The y-intercept of this data is clearly not zero (it looks like it’s closer to 200), so we should expect to improve

performance by adding a bias (a.k.a. intercept) variable, so that our model is

yi = w

T xi + w0.

instead of

yi = w

T xi

.

In file linear model.py, complete the class, LeastSquaresBias, that has the same input/model/predict format

as the LeastSquares class, but that adds a bias variable (also called an intercept) w0 (also called β in lecture).

Hand in your new class, the updated plot, and the updated training/test error.

Hint: recall that adding a bias w0 is equivalent to adding a column of ones to the matrix X. Don’t forget

that you need to do the same transformation in the predict function.

3.2 Polynomial Basis

Rubric: {code:4,reasoning:1}

Adding a bias variable improves the prediction substantially, but the model is still problematic because the

target seems to be a non-linear function of the input. Complete LeastSquarePoly class, that takes a data

vector x (i.e., assuming we only have one feature) and the polynomial order p. The function should perform

a least squares fit based on a matrix Z where each of its rows contains the values (xi)

j

for j = 0 up to p.

E.g., LeastSquaresPoly.fit(x,y) with p = 3 should form the matrix

Z =

1 x1 (x1)

2

(x1)

3

1 x2 (x2)

2

(x2)

3

.

.

.

1 xn (xn)

2

(xN )

3

,

and fit a least squares model based on it. Hand in the new class, and report the training and test error for

p = 0 through p = 10. Explain the effect of p on the training error and on the test error.

Note: you should write the code yourself; don’t use a library like sklearn’s PolynomialFeatures.

6

4 Non-Parametric Bases and Cross-Validation

Unfortunately, in practice we often don’t know what basis to use. However, if we have enough data then

we can make up for this by using a basis that is flexible enough to model any reasonable function. These

may perform poorly if we don’t have much data, but can perform almost as well as the optimal basis as

the size of the dataset grows. Using the same data set as in the previous question, in this question you

will explore using Gaussian radial basis functions (RBFs), which have this property. These RBFs depend

on a hyperparameter σ, which (like p in the polynomial basis) can be chosen using a validation set. In

this question, you will also see how cross-validation allows you to tune parameters of the model on a larger

dataset than a strict training/validation split would allow.

4.1 Proper Training and Validation Sets

Rubric: {reasoning:3}

If you run python main.py -q 4, it will load the dataset and split the training examples into “train” and

“validation” sets. It will then search for the best value of σ for the RBF basis.3 Once it has the “best” value

of σ, it re-trains on the entire dataset and reports the training error on the full training set as well as the

error on the test set.

Unfortunately, there is a problem with the way this is done: the data isn’t shuffled before being split. As a

result, the IID assumption is badly broken and we end up with poor test error. Here is the plot:

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0

200

100

0

100

200

300

Least Squares with RBF kernel and = 64

Fix the problem by either randomizing the split yourself or using sklearn.model selection.train test split

with train size=0.5. Compare the train/test errors and plot before vs. after fixing this problem.

4.2 Cross-Validation

3

if you look at the code you’ll see that it also uses a tiny bit of regularization since ZT Z tends to be very close to singular;

more on this later in the course.

7

Rubric: {code:3,reasoning:1}

Now that we’ve dealt with the randomization, something’s still a bit disturbing: if you run the script more

than once it might choose different values of σ. This variability would be reduced if we had a larger “train”

and “validation” set, and one way to simulate this is with cross-validation.

1. What are two different “best” values of σ you’ve observed after re-running the code a few times? (Not

all students will have the same answer here; that’s OK.)

2. Implement 10-fold cross-validation to select σ, and hand in your code. What value of σ does this

procedure typically select?

4.3 Cost of Non-Parametric Bases

Rubric: {reasoning:3}

When dealing with larger datasets, an important issue is the dependence of the computational cost on the

number of training examples n and the number of features d. What is the cost in big-O notation of training

the model on n training examples with d features under (a) the linear basis and (b) Gaussian RBFs (for a

fixed σ)? What is the cost of classifying t new examples under each of these two bases? When are RBFs

cheaper to train? When are RBFs cheaper to test?

5 Very-Short Answer Questions

5.1 Essentials

Rubric: {reasoning:10}

1. In regression, why do we compute the squared error (yi−yˆi)

2

rather than testing the equality (yi = ˆyi)?

2. Describe a situation in which the least squares estimate would not be unique when d = 2 and n = 4.

3. What is the computational complexity of computing the closed-form (exact) solution to a linear least

squares problem where we have one feature (d = 1) and use polynomial basis of degree p?

4. In what circumstance would a regression tree with linear regressions at the leaves be a better choice

than a linear least squares regression model?

5. When fitting a model, why do we care if our loss function is convex?

6. When should we consider using gradient descent to approximate the solution to the least squares

problem instead of exactly solving it with the closed form solution?

7. Why is optimization non-trivial? Can’t we just set the gradient to zero and be done immediately?

8. Why do we need gradient descent for the robust regression problem, as opposed to just using the

normal equations? Hint: it is NOT because of the non-differentiability. Recall that we used gradient

descent even after smoothing away the non-differentiable part of the loss.

9. What is the problem with having too small of a learning rate in gradient descent?

10. What is the problem with having too large of a learning rate in gradient descent?

8

5.2 These ones are optional and not for marks

1. In LinearModelGradient there’s code that checks your gradient using scipy.optimize.approx fprime.

But, wait a minute: if we can check the gradient, that means we already have it. So, why do we even

bother taking the gradient by hand?

2. What would go wrong if we tried to apply gradient descent to the un-smoothed absolute value loss?

9