## Description

CPSC 340 Assignment 4

Instructions

Rubric: {mechanics:3}

The above points are allocated for following the general homework instructions. In addition to the usual

instructions, we have a NEW REQUIREMENT for this assignment (and future assignments, unless

it’s a disaster): if you’re embedding your answers in a document that also contains the questions, your

answers should be in blue text. This should hopefully make it much easier for the grader to find your

answers. To make something blue, you can use the LaTeX macro \blu{my text}.

1 Convex Functions

Rubric: {reasoning:5}

Recall that convex loss functions are typically easier to minimize than non-convex functions, so it’s important

to be able to identify whether a function is convex.

Show that the following functions are convex:

1. f(w) = αw2 − βw + γ with w ∈ R, α ≥ 0, β ∈ R, γ ∈ R (1D quadratic).

2. f(w) = w log(w) with w > 0 (“neg-entropy”)

3. f(w) = kXw − yk

2 + λkwk1 with w ∈ R

d

, λ ≥ 0 (L1-regularized least squares).

4. f(w) = Pn

i=1 log(1 + exp(−yiw

T xi)) with w ∈ R

d

(logistic regression).

5. f(w, w0) = PN

i=1[max{0, w0 − w

T xi} − w0] + λ

2

kwk

2

2 with w ∈ Rd

, w0 ∈ R, λ ≥ 0 (“1-class” SVM).

General hint: for the first two you can check that the second derivative is non-negative since they are onedimensional. For the last 3 you’ll have to use some of the results regarding how combining convex functions

can yield convex functions; these “notes on convexity” are posted on the course homepage as readings for

Lecture 10.

Hint for part 4 (logistic regression): this function may seem non-convex since it contains log(z) and log is

concave, but there is a flaw in that reasoning: for example log(exp(z)) = z is convex despite containing a

log. To show convexity, you can reduce the problem to showing that log(1 + exp(z)) is convex, which can

be done by computing the second derivative. It may simplify matters to note that exp(z)

1+exp(z) =

1

1+exp(−z)

.

2 Logistic Regression with Sparse Regularization

If you run python main.py -q 2, it will:

1. Load a binary classification dataset containing a training and a validation set.

2. ‘Standardize’ the columns of X and add a bias variable (in utils.load dataset).

3. Apply the same transformation to Xvalidate (in utils.load dataset).

1

4. Fit a logistic regression model.

5. Report the number of features selected by the model (number of non-zero regression weights).

6. Report the error on the validation set.

Logistic regression does ok on this dataset, but it uses all the features (even though only the prime-numbered

features are relevant) and the validation error is above the minimum achievable for this model (which is 1

percent, if you have enough data and know which features are relevant). In this question, you will modify

this demo to use different forms of regularization to improve on these aspects.

Note: your results may vary a bit depending on versions of Python and its libraries.

2.1 L2-Regularization

Rubric: {code:2}

Make a new class, logRegL2, that takes an input parameter λ and fits a logistic regression model with

L2-regularization. Specifically, while logReg computes w by minimizing

f(w) = Xn

i=1

log(1 + exp(−yiw

T xi)),

your new function logRegL2 should compute w by minimizing

f(w) = Xn

i=1

log(1 + exp(−yiw

T xi))

+

λ

2

kwk

2

.

Hand in your updated code. Using this new code with λ = 1, report how the following quantities change:

the training error, the validation error, the number of features used, and the number of gradient descent

iterations.

Note: as you may have noticed, lambda is a special keyword in Python and therefore we can’t use it as a

variable name. As an alternative I humbly suggest lammy, which is what my neice calls her stuffed animal

toy lamb. However, you are free to deviate from this suggestion. In fact, as of Python 3 one can now use

actual greek letters as variable names, like the λ symbol. But, depending on your text editor, it may be

annoying to input this symbol.

2.2 L1-Regularization

Rubric: {code:3}

Make a new class, logRegL1, that takes an input parameter λ and fits a logistic regression model with

L1-regularization,

f(w) = Xn

i=1

log(1 + exp(−yiw

T xi))

+ λkwk1.

Hand in your updated code. Using this new code with λ = 1, report how the following quantities change:

the training error, the validation error, the number of features used, and the number of gradient descent

iterations.

You should use the function minimizers.findMinL1, which implements a proximal-gradient method to minimize the sum of a differentiable function g and λkwk1,

f(w) = g(w) + λkwk1.

2

This function has a similar interface to findMin, EXCEPT that (a) you only pass in the the function/gradient

of the differentiable part, g, rather than the whole function f; and (b) you need to provide the value λ. Thus,

your funObj shouldn’t actually contain the L1 regularization, since it’s implied in the way you express your

objective to the optimizer.

2.3 L0-Regularization

Rubric: {code:4}

The class logRegL0 contains part of the code needed to implement the forward selection algorithm, which

approximates the solution with L0-regularization,

f(w) = Xn

i=1

log(1 + exp(−yiw

T xi))

+ λkwk0.

The for loop in this function is missing the part where we fit the model using the subset selected new, then

compute the score and updates the minLoss/bestFeature. Modify the for loop in this code so that it fits

the model using only the features selected new, computes the score above using these features, and updates

the minLoss/bestFeature variables. Hand in your updated code. Using this new code with λ = 1, report the

training error, validation error, and number of features selected.

Note that the code differs a bit from what we discussed in class, since we assume that the first feature is the

bias variable and assume that the bias variable is always included. Also, note that for this particular case

using the L0-norm with λ = 1 is equivalent to what is known as the Akaike Information Criterion (AIC) for

variable selection.

2.4 Discussion

Rubric: {reasoning:2}

In a short paragraph, briefly discuss your results from the above. How do the different forms of regularization

compare with each other? Can you provide some intuition for your results? No need to write a long essay,

please!

2.5 Comparison with scikit-learn

Rubric: {reasoning:1}

Compare your results (training error, validation error, number of nonzero weights) for L2 and L1 regularization with scikit-learn’s LogisticRegression. Use the penalty parameter to specify the type of regularization.

The parameter C corresponds to 1

λ

, so if you had λ = 1 then use C=1 (which happens to be the default

anyway). You should set fit_intercept to False since we’ve already added the column of ones to X and

thus there’s no need to explicitly fit an intercept parameter. After you’ve trained the model, you can access

the weights with model.coef_.

3 Multi-Class Logistic

If you run python main.py -q 3 the code loads a multi-class classification dataset with yi ∈ {0, 1, 2, 3, 4}

and fits a ‘one-vs-all’ classification model using least squares, then reports the validation error and shows

3

a plot of the data/classifier. The performance on the validation set is ok, but could be much better. For

example, this classifier never even predicts that examples will be in classes 0 or 4.

3.1 Softmax Classification, toy example

Rubric: {reasoning:2}

Linear classifiers make their decisions by finding the class label c maximizing the quantity w

T

c xi

, so we want

to train the model to make w

T

yi

xi

larger than w

T

c

0xi for all the classes c

0

that are not yi

. Here c

0

is a possible

label and wc

0 is row c

0 of W. Similarly, yi

is the training label, wyi

is row yi of W, and in this setting we are

assuming a discrete label yi ∈ {1, 2, . . . , k}. Before we move on to implementing the softmax classifier to fix

the issues raised in the introduction, let’s work through a toy example:

Consider the dataset below, which has n = 10 training examples, d = 2 features, and k = 3 classes:

X =

0 1

1 0

1 0

1 1

1 1

0 0

1 0

1 0

1 1

1 0

, y =

1

1

1

2

2

2

3

3

3

3

.

Suppose that you want to classify the following test example:

xˆ =

1 1

.

Suppose we fit a multi-class linear classifier using the softmax loss, and we obtain the following weight

matrix:

W =

+2 −1

+2 +2

+3 −1

Under this model, what class label would we assign to the test example? (Show your work.)

3.2 One-vs-all Logistic Regression

Rubric: {code:2}

Using the squared error on this problem hurts performance because it has ‘bad errors’ (the model gets

penalized if it classifies examples ‘too correctly’). Write a new class, logLinearClassifier, that replaces the

squared loss in the one-vs-all model with the logistic loss. Hand in the code and report the validation error.

3.3 Softmax Classifier Implementation

Rubric: {code:5}

Using a one-vs-all classifier hurts performance because the classifiers are fit independently, so there is no

attempt to calibrate the columns of the matrix W. An alternative to this independent model is to use the

4

softmax loss, which is given by

f(W) = Xn

i=1 ”

−w

T

yi

xi + log X

k

c

0=1

exp(w

T

c

0xi)

!# ,

The partial derivatives of this function, which make up its gradient, are given by

∂f

∂Wcj

=

Xn

i=1

xij [p(yi = c|W, xi) − I(yi = c)] ,

where…

• I(yi = c) is the indicator function (it is 1 when yi = c and 0 otherwise)

• p(yi = c|W, xi) is the predicted probability of example i being class c, defined as

p(yi

|W, xi) =

exp(w

T

yi

xi)

Pk

c

0=1 exp(wT

c

0xi)

(Good news: in previous offerings of CPSC 340, you had to derive this! I think you’ve probably taken enough

derivatives by now though.)

Make a new class, softmaxClassifier, which fits W using the softmax loss from the previous section instead

of fitting k independent classifiers. Hand in the code and report the validation error.

Hint: you may want to use utils.check_gradient to check that your implementation of the gradient is

correct.

3.4 Comparison with scikit-learn, again

Rubric: {reasoning:1}

Compare your results (training error and validation error for both one-vs-all and softmax) with scikit-learn’s

LogisticRegression, which can also handle multi-class problems. One-vs-all is the default; for softmax,

set multi_class=’multinomial’. For the softmax case, you’ll also need to change the solver. You can use

solver=’lbfgs’. Since your comparison code above isn’t using regularization, set C very large to effectively

disable regularization. Again, set fit_intercept to False for the same reason as above.

3.5 Cost of Multinomial Logistic Regression

Rubric: {reasoning:2}

Assuming that we have

• n training examples.

• d features.

• k classes.

• t testing examples.

• T iterations of gradient descent for training.

1. In O() notation, what is the cost of training the softmax classifier?

2. What is the cost of classifying the test examples?

5

4 Very-Short Answer Questions

Rubric: {reasoning:9}

1. Why would you use a score BIC instead of a validation error for feature selection?

2. Why do we use forward selection instead of exhaustively search all subsets in search and score methods?

3. In L2-regularization, how does λ relate to the two parts of the fundamental trade-off?

4. Give one reason why one might chose to use L1 regularization over L2 and give one reason for the

reverse case.

5. What is the main problem with using least squares to fit a linear model for binary classification?

6. For a linearly separable binary classification problem, how does a linear SVM differ from a classifier

found using the perceptron algorithm?

7. Which of the following methods produce linear classifiers? (a) binary least squares as in Question 3,

(b) the perceptron algorithm, (c) SVMs, and (d) logistic regression.

8. What is the difference between multi-label and multi-class classification?

9. Fill in the question marks: for one-vs-all multi-class logistic regression, we are solving ?? optimization

problem(s) of dimension ??. On the other hand, for softmax logistic regression, we are solving ??

optimization problem(s) of dimension ??.

Hints: we’re looking for short and concise 1-sentence answers, not long and complicated answers. Also, there

is roughly 1 question per lecture.

6