## Description

CPSC 340 Assignment 5

Instructions

Rubric: {mechanics:5}

The above points are allocated for following the general homework instructions. In addition to the usual

instructions: if you’re embedding your answers in a document that also contains the questions, your answers

should be in a colour that clearly stands out, such as green or red. This should hopefully make it much easier

for the grader to find your answers. To make something green, you can use the LaTeX macro \gre{my text}.

Also, READ THIS: Like in a2, you’ll need to grab the data from the course website. FYI: this happens

because I’m using the GitHub API in a fairly silly way, which limits individual files to 1 MB each.

1 MAP Estimation

Rubric: {reasoning:10}

In class, we considered MAP estimation in a regression model where we assumed that:

• The likelihood p(yi

|xi

, w) is a normal distribution with a mean of w

T xi and a variance of 1.

• The prior for each variable j, p(wj ), is a normal distribution with a mean of zero and a variance of

λ

−1

.

Under these assumptions, we showed that this leads to the standard L2-regularized least squares objective

function:

f(w) = 1

2

kXw − yk

2 +

λ

2

kwk

2

.

For each of the alternate assumptions below, show how the loss function would change (simplifying as much

as possible):

1. We use a zero-mean Laplace prior for each variable with a scale parameter of λ

−1

, so that

p(wj ) = λ

2

exp(−λ|wj |).

2. We use a Laplace likelihood with a mean of w

T xi and a scale of 1, so that

p(yi

|xi

, w) = 1

2

exp(−|w

T xi − yi

|).

3. We use a Gaussian likelihood where each datapoint has variance σ

2

instead of 1,

p(yi

|xi

, w) = 1

√

2σ

2π

exp

−

(w

T xi − yi)

2

2σ

2

.

4. We use a Gaussian likelihood where each datapoint has its own variance σ

2

i

,

p(yi

|xi

, w) = 1

p

2σ

2

i π

exp

−

(w

T xi − yi)

2

2σ

2

i

.

1

2 Principal Component Analysis

2.1 PCA by Hand

Rubric: {reasoning:3}

Consider the following dataset, containing 5 examples with 2 features each:

x1 x2

-2 -1

-1 0

0 1

1 2

2 3

Recall that with PCA we usually assume that the PCs are normalized (kwk = 1), we need to center the

data before we apply PCA, and that the direction of the first PC is the one that minimizes the orthogonal

distance to all data points.

1. What is the first principal component?

2. What is the (L2-norm) reconstruction error of the point (3,3)? (Show your work.)

3. What is the (L2-norm) reconstruction error of the point (3,4)? (Show your work.)

2.2 Data Visualization

Rubric: {reasoning:2}

If you run python main.py -q 2, it will load the animals dataset and create a scatterplot based on two

randomly selected features. We label some random points, but because of the binary features the scatterplot

shows us almost nothing about the data.

The class pca.PCA applies the classic PCA method (orthogonal bases via SVD) for a given k. Use this

class so that the scatterplot uses the latent features zi from the PCA model. Make a scatterplot of the two

columns in Z, and label a bunch of the points in the scatterplot. Hand in your code and the scatterplot.

2.3 Data Compression

Rubric: {reasoning:2}

1. How much of the variance is explained by our 2-dimensional representation from the previous question?

2. How many PCs are required to explain 50% of the variance in the data?

3 PCA Generalizations

3.1 Robust PCA

Rubric: {code:10}

If you run python main -q 3.1 the code will load a dataset X where each row contains the pixels from a

single frame of a video of a highway. The demo applies PCA to this dataset and then uses this to reconstruct

2

the original image. It then shows the following 3 images for each frame (pausing and waiting for input

between each frame):

1. The original frame.

2. The reconstruction based on PCA.

3. A binary image showing locations where the reconstruction error is non-trivial.

Recently, latent-factor models have been proposed as a strategy for “background subtraction”: trying to

separate objects from their background. In this case, the background is the highway and the objects are the

cars on the highway. In this demo, we see that PCA does an ok job of identifying the cars on the highway

in that it does tend to identify the locations of cars. However, the results aren’t great as it identifies quite

a few irrelevant parts of the image as objects.

Robust PCA is a variation on PCA where we replace the L2-norm with the L1-norm,

f(Z, W) = Xn

i=1

X

d

j=1

|w

T

j

zi − xij |,

and it has recently been proposed as a more effective model for background subtraction. Complete the

class pca.RobustPCA, that uses a smooth approximation to the absolute value to implement robust PCA.

Comment on the quality of the results.

Hint: most of the work has been done for you in the class pca.AlternativePCA. This work implements an

alternating minimization approach to minimizing the (L2) PCA objective (without enforcing orthogonality).

This gradient-based approach to PCA can be modified to use a smooth approximation of the L1-norm. Note

that the log-sum-exp approximation to the absolute value may be hard to get working due to numerical

issues, and a numerically-nicer approach is to use the “multi-quadric” approximation:

|α| ≈ p

α2 + ,

where controls the accuracy of the approximation (a typical value of is 0.0001).

4 Multi-Dimensional Scaling

If you run python main.py -q 4, the code will load the animals dataset and then apply gradient descent

to minimize the following multi-dimensional scaling (MDS) objective (starting from the PCA solution):

f(Z) = 1

2

Xn

i=1

Xn

j=i+1

(kzi − zjk − kxi − xjk)

2

. (1)

The result of applying MDS is shown below.

3

Although this visualization isn’t perfect (with “gorilla” being placed close to the dogs and “otter” being

placed close to two types of bears), this visualization does organize the animals in a mostly-logical way.

4.1 ISOMAP

Rubric: {code:10}

Euclidean distances between very different animals are unlikely to be particularly meaningful. However,

since related animals tend to share similar traits we might expect the animals to live on a low-dimensional

manifold. This suggests that ISOMAP may give a better visualization. Fill in the class ISOMAP so that

it computes the approximate geodesic distance (shortest path through a graph where the edges are only

between nodes that are k-nearest neighbours) between each pair of points, and then fits a standard MDS

model (??) using gradient descent. Plot the results using 2 and using 3-nearest neighbours.

Note: when we say 2 nearest neighbours, we mean the two closest neigbours excluding the point itself. This

is the opposite convention from what we used in KNN at the start of the course.

The function utils.dijskstra can be used to compute the shortest (weighted) distance between two points in a

weighted graph. This function requires an n × n matrix giving the weights on each edge (use 0 as the weight

for absent edges). Note that ISOMAP uses an undirected graph, while the k-nearest neighbour graph might

be asymmetric. One of the usual heuristics to turn this into a undirected graph is to include an edge i to j if

i is a KNN of j or if j is a KNN of i. (Another possibility is to include an edge only if i and j are mutually

KNNs.)

4.2 Reflection

Rubric: {reasoning:2}

Briefly comment on PCA vs. MDS vs. ISOMAP for dimensionality reduction on this particular data set. In

your opinion, which method did the best job and why?

5 Very-Short Answer Questions

Rubric: {reasoning:10}

4

1. Why is the kernel trick often better than explicitly transforming your features into a new space?

2. Why is the kernel trick more popular for SVMs than with logistic regression?

3. What is the key advantage of stochastic gradient methods over gradient descent methods?

4. Does stochastic gradient descent with a fixed α converge to the minimum of a convex function in

general?

5. What is the difference between multi-label and multi-class classification?

6. What is the difference between MLE and MAP?

7. Linear regression with one feature and PCA with 2 features (and k = 1) both find a line in a twodimensional space. Do they find the same line? Briefly justify your answer.

8. Are the vectors minimizing the PCA objective unique? Briefly justify your answer

9. Name two methods for promoting sparse solutions in a linear regression model that result in convex

problems.

10. Can we use the normal equations to solve non-negative least squares problems?

5