Sale!

# CPSC 340 Assignment 3 solution

\$30.00

CPSC 340 Assignment 3
Instructions
Rubric: {mechanics:3}
The above points are allocated for following the general homework instructions on the course homepage.
Contents
1 Vectors, Matrices, and Quadratic Functions 1
1.1 Basic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Converting to Matrix/Vector/Norm Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Minimizing Quadratic Functions as Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Robust Regression and Gradient Descent 3
2.1 Weighted Least Squares in One Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Smooth Approximation to the L1-Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Robust Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Category:

## Description

5/5 - (1 vote)

CPSC 340 Assignment 3
Instructions
Rubric: {mechanics:3}
The above points are allocated for following the general homework instructions on the course homepage.
Contents
1 Vectors, Matrices, and Quadratic Functions 1
1.1 Basic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Converting to Matrix/Vector/Norm Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Minimizing Quadratic Functions as Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Robust Regression and Gradient Descent 3
2.1 Weighted Least Squares in One Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Smooth Approximation to the L1-Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Robust Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Linear Regression and Nonlinear Bases 5
3.1 Adding a Bias Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Polynomial Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4 Non-Parametric Bases and Cross-Validation 7
4.1 Proper Training and Validation Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.2 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.3 Cost of Non-Parametric Bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
5.1 Essentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
5.2 These ones are optional and not for marks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1 Vectors, Matrices, and Quadratic Functions
The first part of this question makes you review basic operations on vectors and matrices. If you are
rusty on basic vector and matrix operations, see the notes on linear algebra on the course webpage. If you
have time and want a deeper refresher on linear algebra, I have heard good things about the video series
Essence of Linear Algebra at https://youtu.be/kjBOesZCoqc and the e-book Immersive Linear Algebra at
http://immersivemath.com/ila/index.html. We will continue to use linear algebra heavily throughout
the rest of the course.
1.1 Basic Operations
1
Rubric: {reasoning:3}
Using the definitions below,
α = 5, x =

2
−3

, y =

1
4

, z =

2
0
1

 , A =

1 2
2 3
3 −2

 ,
evaluate the following expressions (show your work, but you may use answers from previous parts to simplify
calculations):
1. x
T x.
2. kxk
2
.
3. x
T
(x + αy).
4. Ax
5. z
T Ax
6. AT A.
If {α, β} are scalars, {x, y, z} are real-valued column-vectors of length d, and {A, B, C} are real-valued d × d
matrices, state whether each of the below statements is true or false in general and give a short explanation.
7. yyT y = kyk
2y.
8. x
T AT
(Ay + Az) = x
T AT Ay + z
T AT Ax.
9. x
T
(B + C) = Bx + Cx.
10. (A + BC)
T = AT + C
T BT
.
11. (x − y)
T
(x − y) = kxk
2 − x
T y + kyk
2
.
12. (x − y)
T
(x + y) = kxk
2 − kyk
2
.
Hint: check the dimensions of the result, and remember that matrix multiplication is generally not commutative.
1.2 Converting to Matrix/Vector/Norm Notation
Rubric: {reasoning:2}
Using our standard supervised learning notation (X, y, w) express the following functions in terms of vectors,
matrices, and norms (there should be no summations or maximums).
1. Pn
i=1 |w
T xi − yi
|.
2. maxi∈{1,2,…,n} |w
T xi − yi
| +
λ
2
Pd
j=1 w
2
j
.
3. Pn
i=1 zi(w
T xi − yi)
2 + λ
Pd
j=1 |wj |.
You can use Z to denote a diagonal matrix that has the values zi along the diagonal.
2
1.3 Minimizing Quadratic Functions as Linear Systems
Rubric: {reasoning:3}
Write finding a minimizer w of the functions below as a system of linear equations (using vector/matrix
notation and simplifying as much as possible). Note that all the functions below are convex so finding a w
with ∇f(w) = 0 is sufficient to minimize the functions (but show your work in getting to this point).
1. f(w) = 1
2
kw − vk
2
.
2. f(w) = 1
2
kwk
2 + w
T XT y .
3. f(w) = 1
2
Pn
i=1 zi(w
T xi − yi)
2
.
Above we assume that v is a d × 1 vector.
Hint: Once you convert to vector/matrix notation, you can use the results from class to quickly compute
these quantities term-wise. As a sanity check for your derivation, make sure that your results have the right
dimensions.
2 Robust Regression and Gradient Descent
If you run python main.py -q 2, it will load a one-dimensional regression dataset that has a non-trivial
number of ‘outlier’ data points. These points do not fit the general trend of the rest of the data, and pull
the least squares model away from the main downward trend that most data points exhibit:
0.0 0.2 0.4 0.6 0.8 1.0
5
0
5
10
15
20
25
Least Squares
Note: we are fitting the regression without an intercept here, just for simplicity of the homework question.
In reality one would rarely do this. But here it’s OK because the “true” line passes through the origin (by
design). In Q3.1 we’ll address this explicitly.
3
2.1 Weighted Least Squares in One Dimension
Rubric: {code:3}
One of the most common variations on least squares is weighted least squares. In this formulation, we have
a weight zi for every training example. To fit the model, we minimize the weighted squared error,
f(w) = 1
2
Xn
i=1
zi(w
T xi − yi)
2
.
In this formulation, the model focuses on making the error small for examples i where zi
is high. Similarly,
if zi
is low then the model allows a larger error.
Complete the model class, WeightedLeastSquares, that implements this model (note that Q1.3.3 asks you
to show how this formulation can be solved as a linear system). Apply this model to the data containing
outliers, setting z = 1 for the first 400 data points and z = 0.1 for the last 100 data points (which are the
outliers). Hand in your code and the updated plot.
2.2 Smooth Approximation to the L1-Norm
Rubric: {reasoning:3}
Unfortunately, we typically do not know the identities of the outliers. In situations where we suspect that
there are outliers, but we do not know which examples are outliers, it makes sense to use a loss function that
is more robust to outliers. In class, we discussed using the sum of absolute values objective,
f(w) = Xn
i=1
|w
T xi − yi
|.
This is less sensitive to outliers than least squares, but it is non-differentiable and harder to optimize.
Nevertheless, there are various smooth approximations to the absolute value function that are easy to
optimize. One possible approximation is to use the log-sum-exp approximation of the max function1
:
|r| = max{r, −r} ≈ log(exp(r) + exp(−r)).
Using this approximation, we obtain an objective of the form
f(w)=Xn
i=1
log
exp(w
T xi − yi) + exp(yi − w
T xi)

.
which is smooth but less sensitive to outliers than the squared error. Derive the gradient ∇f of this function
with respect to w. You should show your work but you do not have to express the final result in matrix
notation.
2.3 Robust Regression
Rubric: {code:2,reasoning:1}
The class LinearModelGradient is the same as LeastSquares, except that it fits the least squares model using
a gradient descent method. If you run python main.py -q 2.3 you’ll see it produces the same fit as we
obtained using the normal equations.
1Other possibilities are the Huber loss, or |r| ≈ √
r
2 +  for some small .

The typical input to a gradient method is a function that, given w, returns f(w) and ∇f(w). See funObj in
LinearModelGradient for an example. Note that the fit function of LinearModelGradient also has a numerical
check that the gradient code is approximately correct, since implementing gradients is often error-prone.2
An advantage of gradient-based strategies is that they are able to solve problems that do not have closedform solutions, such as the formulation from the previous section. The class LinearModelGradient has most
of the implementation of a gradient-based strategy for fitting the robust regression model under the log-sumexp approximation. The only part missing is the function and gradient calculation inside the funObj code.
Modify funObj to implement the objective function and gradient based on the smooth approximation to the
absolute value function (from the previous section). Hand in your code, as well as the plot obtained using
this robust regression approach.
3 Linear Regression and Nonlinear Bases
In class we discussed fitting a linear regression model by minimizing the squared error. In this question,
you will start with a data set where least squares performs poorly. You will then explore how adding a
bias variable and using nonlinear (polynomial) bases can drastically improve the performance. You will also
explore how the complexity of a basis affects both the training error and the test error. In the final part of
the question, it will be up to you to design a basis with better performance than polynomial bases.
Rubric: {code:3,reasoning:1}
If you run python main.py -q 3, it will:
1. Load a one-dimensional regression dataset.
2. Fit a least-squares linear regression model.
3. Report the training error.
4. Report the test error (on a dataset not used for training).
5. Draw a figure showing the training data and what the linear model looks like.
Unfortunately, this is an awful model of the data. The average squared training error on the data set is over
28000 (as is the test error), and the figure produced by the demo confirms that the predictions are usually
nowhere near the training data:
2Sometimes the numerical gradient checker itself can be wrong. See CPSC 303 for a lot more on numerical differentiation.
5
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
200
100
0
100
200
300
Least Squares, no bias
The y-intercept of this data is clearly not zero (it looks like it’s closer to 200), so we should expect to improve
performance by adding a bias (a.k.a. intercept) variable, so that our model is
yi = w
T xi + w0.
yi = w
T xi
.
In file linear model.py, complete the class, LeastSquaresBias, that has the same input/model/predict format
as the LeastSquares class, but that adds a bias variable (also called an intercept) w0 (also called β in lecture).
Hand in your new class, the updated plot, and the updated training/test error.
Hint: recall that adding a bias w0 is equivalent to adding a column of ones to the matrix X. Don’t forget
that you need to do the same transformation in the predict function.
3.2 Polynomial Basis
Rubric: {code:4,reasoning:1}
Adding a bias variable improves the prediction substantially, but the model is still problematic because the
target seems to be a non-linear function of the input. Complete LeastSquarePoly class, that takes a data
vector x (i.e., assuming we only have one feature) and the polynomial order p. The function should perform
a least squares fit based on a matrix Z where each of its rows contains the values (xi)
j
for j = 0 up to p.
E.g., LeastSquaresPoly.fit(x,y) with p = 3 should form the matrix
Z =

1 x1 (x1)
2
(x1)
3
1 x2 (x2)
2
(x2)
3
.
.
.
1 xn (xn)
2
(xN )
3

,
and fit a least squares model based on it. Hand in the new class, and report the training and test error for
p = 0 through p = 10. Explain the effect of p on the training error and on the test error.
Note: you should write the code yourself; don’t use a library like sklearn’s PolynomialFeatures.
6
4 Non-Parametric Bases and Cross-Validation
Unfortunately, in practice we often don’t know what basis to use. However, if we have enough data then
we can make up for this by using a basis that is flexible enough to model any reasonable function. These
may perform poorly if we don’t have much data, but can perform almost as well as the optimal basis as
the size of the dataset grows. Using the same data set as in the previous question, in this question you
will explore using Gaussian radial basis functions (RBFs), which have this property. These RBFs depend
on a hyperparameter σ, which (like p in the polynomial basis) can be chosen using a validation set. In
this question, you will also see how cross-validation allows you to tune parameters of the model on a larger
dataset than a strict training/validation split would allow.
4.1 Proper Training and Validation Sets
Rubric: {reasoning:3}
If you run python main.py -q 4, it will load the dataset and split the training examples into “train” and
“validation” sets. It will then search for the best value of σ for the RBF basis.3 Once it has the “best” value
of σ, it re-trains on the entire dataset and reports the training error on the full training set as well as the
error on the test set.
Unfortunately, there is a problem with the way this is done: the data isn’t shuffled before being split. As a
result, the IID assumption is badly broken and we end up with poor test error. Here is the plot:
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
200
100
0
100
200
300
Least Squares with RBF kernel and = 64
Fix the problem by either randomizing the split yourself or using sklearn.model selection.train test split
with train size=0.5. Compare the train/test errors and plot before vs. after fixing this problem.
4.2 Cross-Validation
3
if you look at the code you’ll see that it also uses a tiny bit of regularization since ZT Z tends to be very close to singular;
more on this later in the course.
7
Rubric: {code:3,reasoning:1}
Now that we’ve dealt with the randomization, something’s still a bit disturbing: if you run the script more
than once it might choose different values of σ. This variability would be reduced if we had a larger “train”
and “validation” set, and one way to simulate this is with cross-validation.
1. What are two different “best” values of σ you’ve observed after re-running the code a few times? (Not
all students will have the same answer here; that’s OK.)
2. Implement 10-fold cross-validation to select σ, and hand in your code. What value of σ does this
procedure typically select?
4.3 Cost of Non-Parametric Bases
Rubric: {reasoning:3}
When dealing with larger datasets, an important issue is the dependence of the computational cost on the
number of training examples n and the number of features d. What is the cost in big-O notation of training
the model on n training examples with d features under (a) the linear basis and (b) Gaussian RBFs (for a
fixed σ)? What is the cost of classifying t new examples under each of these two bases? When are RBFs
cheaper to train? When are RBFs cheaper to test?
5.1 Essentials
Rubric: {reasoning:10}
1. In regression, why do we compute the squared error (yi−yˆi)
2
rather than testing the equality (yi = ˆyi)?
2. Describe a situation in which the least squares estimate would not be unique when d = 2 and n = 4.
3. What is the computational complexity of computing the closed-form (exact) solution to a linear least
squares problem where we have one feature (d = 1) and use polynomial basis of degree p?
4. In what circumstance would a regression tree with linear regressions at the leaves be a better choice
than a linear least squares regression model?
5. When fitting a model, why do we care if our loss function is convex?
6. When should we consider using gradient descent to approximate the solution to the least squares
problem instead of exactly solving it with the closed form solution?
7. Why is optimization non-trivial? Can’t we just set the gradient to zero and be done immediately?
8. Why do we need gradient descent for the robust regression problem, as opposed to just using the
normal equations? Hint: it is NOT because of the non-differentiability. Recall that we used gradient
descent even after smoothing away the non-differentiable part of the loss.
9. What is the problem with having too small of a learning rate in gradient descent?
10. What is the problem with having too large of a learning rate in gradient descent?
8
5.2 These ones are optional and not for marks