Sale!

CPSC 340 Assignment 4 solution

$30.00

CPSC 340 Assignment 4
Instructions
Rubric: {mechanics:3}
The above points are allocated for following the general homework instructions. In addition to the usual
instructions, we have a NEW REQUIREMENT for this assignment (and future assignments, unless
it’s a disaster): if you’re embedding your answers in a document that also contains the questions, your
answers should be in blue text. This should hopefully make it much easier for the grader to find your
answers. To make something blue, you can use the LaTeX macro \blu{my text}.
1 Convex Functions

Category:

Description

5/5 - (2 votes)

CPSC 340 Assignment 4
Instructions
Rubric: {mechanics:3}
The above points are allocated for following the general homework instructions. In addition to the usual
instructions, we have a NEW REQUIREMENT for this assignment (and future assignments, unless
it’s a disaster): if you’re embedding your answers in a document that also contains the questions, your
answers should be in blue text. This should hopefully make it much easier for the grader to find your
answers. To make something blue, you can use the LaTeX macro \blu{my text}.
1 Convex Functions
Rubric: {reasoning:5}
Recall that convex loss functions are typically easier to minimize than non-convex functions, so it’s important
to be able to identify whether a function is convex.
Show that the following functions are convex:
1. f(w) = αw2 − βw + γ with w ∈ R, α ≥ 0, β ∈ R, γ ∈ R (1D quadratic).
2. f(w) = w log(w) with w > 0 (“neg-entropy”)
3. f(w) = kXw − yk
2 + λkwk1 with w ∈ R
d
, λ ≥ 0 (L1-regularized least squares).
4. f(w) = Pn
i=1 log(1 + exp(−yiw
T xi)) with w ∈ R
d
(logistic regression).
5. f(w, w0) = PN
i=1[max{0, w0 − w
T xi} − w0] + λ
2
kwk
2
2 with w ∈ Rd
, w0 ∈ R, λ ≥ 0 (“1-class” SVM).
General hint: for the first two you can check that the second derivative is non-negative since they are onedimensional. For the last 3 you’ll have to use some of the results regarding how combining convex functions
can yield convex functions; these “notes on convexity” are posted on the course homepage as readings for
Lecture 10.
Hint for part 4 (logistic regression): this function may seem non-convex since it contains log(z) and log is
concave, but there is a flaw in that reasoning: for example log(exp(z)) = z is convex despite containing a
log. To show convexity, you can reduce the problem to showing that log(1 + exp(z)) is convex, which can
be done by computing the second derivative. It may simplify matters to note that exp(z)
1+exp(z) =
1
1+exp(−z)
.
2 Logistic Regression with Sparse Regularization
If you run python main.py -q 2, it will:
1. Load a binary classification dataset containing a training and a validation set.
2. ‘Standardize’ the columns of X and add a bias variable (in utils.load dataset).
3. Apply the same transformation to Xvalidate (in utils.load dataset).
1
4. Fit a logistic regression model.
5. Report the number of features selected by the model (number of non-zero regression weights).
6. Report the error on the validation set.
Logistic regression does ok on this dataset, but it uses all the features (even though only the prime-numbered
features are relevant) and the validation error is above the minimum achievable for this model (which is 1
percent, if you have enough data and know which features are relevant). In this question, you will modify
this demo to use different forms of regularization to improve on these aspects.
Note: your results may vary a bit depending on versions of Python and its libraries.
2.1 L2-Regularization
Rubric: {code:2}
Make a new class, logRegL2, that takes an input parameter λ and fits a logistic regression model with
L2-regularization. Specifically, while logReg computes w by minimizing
f(w) = Xn
i=1
log(1 + exp(−yiw
T xi)),
your new function logRegL2 should compute w by minimizing
f(w) = Xn
i=1

log(1 + exp(−yiw
T xi))
+
λ
2
kwk
2
.
Hand in your updated code. Using this new code with λ = 1, report how the following quantities change:
the training error, the validation error, the number of features used, and the number of gradient descent
iterations.
Note: as you may have noticed, lambda is a special keyword in Python and therefore we can’t use it as a
variable name. As an alternative I humbly suggest lammy, which is what my neice calls her stuffed animal
toy lamb. However, you are free to deviate from this suggestion. In fact, as of Python 3 one can now use
actual greek letters as variable names, like the λ symbol. But, depending on your text editor, it may be
annoying to input this symbol.
2.2 L1-Regularization
Rubric: {code:3}
Make a new class, logRegL1, that takes an input parameter λ and fits a logistic regression model with
L1-regularization,
f(w) = Xn
i=1

log(1 + exp(−yiw
T xi))
+ λkwk1.
Hand in your updated code. Using this new code with λ = 1, report how the following quantities change:
the training error, the validation error, the number of features used, and the number of gradient descent
iterations.
You should use the function minimizers.findMinL1, which implements a proximal-gradient method to minimize the sum of a differentiable function g and λkwk1,
f(w) = g(w) + λkwk1.
2
This function has a similar interface to findMin, EXCEPT that (a) you only pass in the the function/gradient
of the differentiable part, g, rather than the whole function f; and (b) you need to provide the value λ. Thus,
your funObj shouldn’t actually contain the L1 regularization, since it’s implied in the way you express your
objective to the optimizer.
2.3 L0-Regularization
Rubric: {code:4}
The class logRegL0 contains part of the code needed to implement the forward selection algorithm, which
approximates the solution with L0-regularization,
f(w) = Xn
i=1

log(1 + exp(−yiw
T xi))
+ λkwk0.
The for loop in this function is missing the part where we fit the model using the subset selected new, then
compute the score and updates the minLoss/bestFeature. Modify the for loop in this code so that it fits
the model using only the features selected new, computes the score above using these features, and updates
the minLoss/bestFeature variables. Hand in your updated code. Using this new code with λ = 1, report the
training error, validation error, and number of features selected.
Note that the code differs a bit from what we discussed in class, since we assume that the first feature is the
bias variable and assume that the bias variable is always included. Also, note that for this particular case
using the L0-norm with λ = 1 is equivalent to what is known as the Akaike Information Criterion (AIC) for
variable selection.
2.4 Discussion
Rubric: {reasoning:2}
In a short paragraph, briefly discuss your results from the above. How do the different forms of regularization
compare with each other? Can you provide some intuition for your results? No need to write a long essay,
please!
2.5 Comparison with scikit-learn
Rubric: {reasoning:1}
Compare your results (training error, validation error, number of nonzero weights) for L2 and L1 regularization with scikit-learn’s LogisticRegression. Use the penalty parameter to specify the type of regularization.
The parameter C corresponds to 1
λ
, so if you had λ = 1 then use C=1 (which happens to be the default
anyway). You should set fit_intercept to False since we’ve already added the column of ones to X and
thus there’s no need to explicitly fit an intercept parameter. After you’ve trained the model, you can access
the weights with model.coef_.
3 Multi-Class Logistic
If you run python main.py -q 3 the code loads a multi-class classification dataset with yi ∈ {0, 1, 2, 3, 4}
and fits a ‘one-vs-all’ classification model using least squares, then reports the validation error and shows
3
a plot of the data/classifier. The performance on the validation set is ok, but could be much better. For
example, this classifier never even predicts that examples will be in classes 0 or 4.
3.1 Softmax Classification, toy example
Rubric: {reasoning:2}
Linear classifiers make their decisions by finding the class label c maximizing the quantity w
T
c xi
, so we want
to train the model to make w
T
yi
xi
larger than w
T
c
0xi for all the classes c
0
that are not yi
. Here c
0
is a possible
label and wc
0 is row c
0 of W. Similarly, yi
is the training label, wyi
is row yi of W, and in this setting we are
assuming a discrete label yi ∈ {1, 2, . . . , k}. Before we move on to implementing the softmax classifier to fix
the issues raised in the introduction, let’s work through a toy example:
Consider the dataset below, which has n = 10 training examples, d = 2 features, and k = 3 classes:
X =
















0 1
1 0
1 0
1 1
1 1
0 0
1 0
1 0
1 1
1 0
















, y =
















1
1
1
2
2
2
3
3
3
3
















.
Suppose that you want to classify the following test example:
xˆ =

1 1
.
Suppose we fit a multi-class linear classifier using the softmax loss, and we obtain the following weight
matrix:
W =


+2 −1
+2 +2
+3 −1


Under this model, what class label would we assign to the test example? (Show your work.)
3.2 One-vs-all Logistic Regression
Rubric: {code:2}
Using the squared error on this problem hurts performance because it has ‘bad errors’ (the model gets
penalized if it classifies examples ‘too correctly’). Write a new class, logLinearClassifier, that replaces the
squared loss in the one-vs-all model with the logistic loss. Hand in the code and report the validation error.
3.3 Softmax Classifier Implementation
Rubric: {code:5}
Using a one-vs-all classifier hurts performance because the classifiers are fit independently, so there is no
attempt to calibrate the columns of the matrix W. An alternative to this independent model is to use the
4
softmax loss, which is given by
f(W) = Xn
i=1 ”
−w
T
yi
xi + log X
k
c
0=1
exp(w
T
c
0xi)
!# ,
The partial derivatives of this function, which make up its gradient, are given by
∂f
∂Wcj
=
Xn
i=1
xij [p(yi = c|W, xi) − I(yi = c)] ,
where…
• I(yi = c) is the indicator function (it is 1 when yi = c and 0 otherwise)
• p(yi = c|W, xi) is the predicted probability of example i being class c, defined as
p(yi
|W, xi) =
exp(w
T
yi
xi)
Pk
c
0=1 exp(wT
c
0xi)
(Good news: in previous offerings of CPSC 340, you had to derive this! I think you’ve probably taken enough
derivatives by now though.)
Make a new class, softmaxClassifier, which fits W using the softmax loss from the previous section instead
of fitting k independent classifiers. Hand in the code and report the validation error.
Hint: you may want to use utils.check_gradient to check that your implementation of the gradient is
correct.
3.4 Comparison with scikit-learn, again
Rubric: {reasoning:1}
Compare your results (training error and validation error for both one-vs-all and softmax) with scikit-learn’s
LogisticRegression, which can also handle multi-class problems. One-vs-all is the default; for softmax,
set multi_class=’multinomial’. For the softmax case, you’ll also need to change the solver. You can use
solver=’lbfgs’. Since your comparison code above isn’t using regularization, set C very large to effectively
disable regularization. Again, set fit_intercept to False for the same reason as above.
3.5 Cost of Multinomial Logistic Regression
Rubric: {reasoning:2}
Assuming that we have
• n training examples.
• d features.
• k classes.
• t testing examples.
• T iterations of gradient descent for training.
1. In O() notation, what is the cost of training the softmax classifier?
2. What is the cost of classifying the test examples?
5
4 Very-Short Answer Questions
Rubric: {reasoning:9}
1. Why would you use a score BIC instead of a validation error for feature selection?
2. Why do we use forward selection instead of exhaustively search all subsets in search and score methods?
3. In L2-regularization, how does λ relate to the two parts of the fundamental trade-off?
4. Give one reason why one might chose to use L1 regularization over L2 and give one reason for the
reverse case.
5. What is the main problem with using least squares to fit a linear model for binary classification?
6. For a linearly separable binary classification problem, how does a linear SVM differ from a classifier
found using the perceptron algorithm?
7. Which of the following methods produce linear classifiers? (a) binary least squares as in Question 3,
(b) the perceptron algorithm, (c) SVMs, and (d) logistic regression.
8. What is the difference between multi-label and multi-class classification?
9. Fill in the question marks: for one-vs-all multi-class logistic regression, we are solving ?? optimization
problem(s) of dimension ??. On the other hand, for softmax logistic regression, we are solving ??
optimization problem(s) of dimension ??.
Hints: we’re looking for short and concise 1-sentence answers, not long and complicated answers. Also, there
is roughly 1 question per lecture.
6