## Description

CPSC 340 Assignment 2

Instructions

Rubric: {mechanics:5}

The above points are allocated for compliance with the CPSC 340 homework submission instructions:

https://github.ugrad.cs.ubc.ca/CPSC340-2017W-T2/home/blob/master/homework_instructions.md

NOTE: for this assignment you’ll need to separately download the data from the home repo. It can’t be

delivered in the normal way due to a limitation of the way we’re using GitHub.

1 Naive Bayes

In this section we’ll implement naive Bayes, a very fast classification method that is often surprisingly

accurate for text data with simple representations like bag of words.

1.1 Naive Bayes by Hand

Rubric: {reasoning:3}

Consider the dataset below, which has 10 training examples and 2 features:

X =

2

6

6

6

6

6

6

6

6

6

6

6

6

6

6

4

0 1

1 1

0 0

1 1

1 1

0 0

1 0

1 0

1 1

1 0

3

7

7

7

7

7

7

7

7

7

7

7

7

7

7

5

, y =

2

6

6

6

6

6

6

6

6

6

6

6

6

6

6

4

1

1

1

1

1

1

0

0

0

0

3

7

7

7

7

7

7

7

7

7

7

7

7

7

7

5

.

Suppose you believe that a naive Bayes model would be appropriate for this dataset, and you want to classify

the following test example:

xˆ = ⇥

1 0⇤

.

(a) Compute the estimates of the class prior probabilities (you don’t need to show any work):

• p(y = 1).

• p(y = 0).

(b) Compute the estimates of the 4 conditional probabilities required by naive Bayes for this example (you

don’t need to show any work):

1

• p(x1 = 1|y = 1).

• p(x2 = 0|y = 1).

• p(x1 = 1|y = 0).

• p(x2 = 0|y = 0).

(c) Under the naive Bayes model and your estimates of the above probabilities, what is the most likely label

for the test example? (Show your work.)

1.2 Bag of Words

Rubric: {reasoning:3}

If you run python main.py -q 1.2, it will load the following dataset:

1. X: A sparse binary matrix. Each row corresponds to a newsgroup post, and each column corresponds

to whether a particular word was used in the post. A value of 1 means that the word occured in the

post.

2. wordlist: The set of words that correspond to each column.

3. y: A vector with values 1 through 4, with the value corresponding to the newsgroup that the post

came from.

4. groupnames: The names of the four newsgroups.

5. Xvalidate and yvalidate: the word lists and newsgroup labels for additional newsgroup posts.

Answer the following:

1. Which word corresponds to column 50 of X?

2. Which words are present in training example 500?

3. Which newsgroup name does training example 500 come from?

1.3 Naive Bayes Implementation

Rubric: {code:5}

If you run python main.py -q 1.3 it will load the newsgroups dataset and report the test error for a

random forest, and also fit the basic naive Bayes model and report the test error.

The predict() function of the naive Bayes classifier is already implemented. However, in fit() the calculation of the variable p xy is incorrect (right now, it just sets all values to 1/2). Modify this function so that

p xy correctly computes the conditional probabilities of these values based on the frequencies in the data

set. Hand in your code and the validation error that you obtain. Also, briefly comment on the accuracy as

compared to the random forest and scikit-learn’s naive Bayes implementation.

1.4 Laplace smoothing

Rubric: {code:1}

Do the following:

1. Modify your code so that it uses Laplace smoothing, with as a parameter taken in by the constructor.

2

2. Did you need to modify your fit function, predict function, or both?

3. Take a look at the documentation for the scikit-learn version of naive Bayes the code is using. How

much Laplace smoothing does it use by default? Using the same amount of smoothing with your code,

do you get the same results?

1.5 Runtime of Naive Bayes for Discrete Data

Rubric: {reasoning:3}

Assume you have the following setup:

• The training set has n objects each with d features.

• The test set has t objects with d features.

• Each feature can have up to c discrete values (you can assume c n).

• There are k class labels (you can assume k n)

You can implement the training phase of a naive Bayes classifier in this setup in O(nd), since you only need

to do a constant amount of work for each X(i, j) value. (You do not have to actually implement it in this

way for the previous question, but you should think about how this could be done). What is the cost of

classifying t test examples with the model?

2 Random Forests

2.1 Implementation

Rubric: {reasoning:7}

The file vowels.pkl contains a supervised learning dataset where we are trying to predict which of the 11

“steady-state” English vowels that a speaker is trying to pronounce.

You are provided with a RandomStump class that di↵ers from DecisionStump in two ways: it uses the

information gain splitting criterion (instead of classification accuracy), and it only considers b

p

dc randomlychosen features.1 You are also provided with a RandomTree class that is exactly the same as DecisionTree

except that it uses RandomStump instead of DecisionStump and it takes a bootstrap sample of the data

before fitting. In other words, RandomTree is the entity we discussed in class, which makes up a random

forest.

If you run python main.py -q 2 it will fit a deep DecisionTree using the information gain splitting criterion. You will notice that the model overfits badly.

1. Why doesn’t the random tree model have a training error of 0?

2. Create a class RandomForest in a file called random_forest.py that takes in hyperparameters num_trees

and max_depth and fits num_trees random trees each with maximum depth max_depth. For prediction,

have all trees predict and then take the mode.

3. Using 50 trees, and a max depth of 1, report the training and testing error. Compare this to what we

got with a single DecisionTree and with a single RandomTree. Are the results what you expected?

Discuss.

1The notation bxc means the “floor” of x, or “x rounded down”. You can compute this with np.floor(x) or math.floor(x).

3

4. Compare your implementation with scikit-learn’s RandomForestClassifier for both speed and accuracy, and briefly discuss. You can use all default hyperparameters if you wish, or you can try changing

them.

2.2 Very-Short Answer Questions

Rubric: {reasoning:3}

1. What is a a disadvantage of using a very-large number of trees in a random forest classifier?

2. Your random forest classifier has a training error of 0 and a very high test error. Which ones of the

following could help performance?

(a) Increase the maximum depth of the trees in your forest.

(b) Decrease the maximum depth of the trees in your forest.

(c) Increase the amout of data you consider for each tree (Collect more data and use 2n objects

instead of n).

(d) Decrease the amount of data you consider for each tree (Use 0.8n objects instead of n).

(e) Increase the number of features you consider for each tree.

(f) Decrease the number of features you consider for each tree.

3. Suppose that you were training on raw audio segments and trying to recognize vowel sounds. What

could you do to encourage the final classifier to be invariant to translation?

3 Clustering

If you run python main.py -q 3, it will load a dataset with two features and a very obvious clustering

structure. It will then apply the k-means algorithm with a random initialization. The result of applying the

algorithm will thus depend on the randomization, but a typical run might look like this:

(Note that the colours are arbitrary – this is the label switching issue.) But the ‘correct’ clustering (that

was used to make the data) is this:

4

3.1 Selecting among k-means Initializations

Rubric: {reasoning:5}

If you run the demo several times, it will find di↵erent clusterings. To select among clusterings for a fixed

value of k, one strategy is to minimize the sum of squared distances between examples xi and their means

wyi ,

f(w1, w2,…,wk, y1, y2,…,yn) = Xn

i=1

kxi wyi k2

2 = Xn

i=1

X

d

j=1

(xij wyij )

2.

where yi is the index of the closest mean to xi. This is a natural criterion because the steps of k-means

alternately optimize this objective function in terms of the wc and the yi values.

1. In the kmeans.py file, add a new function called error that takes the same input as the predict function

but that returns the value of this above objective function. Hand in your code.

2. What trend do you observe if you print the value of error after each iteration of the k-means algorithm?

3. Using plot 2dclustering, output the clustering obtained by running k-means 50 times (with k = 4)

and taking the one with the lowest error.

4. Looking at the hyperparameters of scikit-learn’s KMeans, explain the first four (n clusters, init,

n init, max iter) very briefly.

3.2 Selecting k in k-means

Rubric: {reasoning:5}

We now turn to the task of choosing the number of clusters k.

1. Explain why the error function should not be used to choose k.

2. Explain why even evaluating the error function on test data still wouldn’t be a suitable approach to

choosing k.

3. Hand in a plot of the minimum error found across 50 random initializations, as a function of k, taking

k from 1 to 10.

5

4. The elbow method for choosing k consists of looking at the above plot and visually trying to choose

the k that makes the sharpest “elbow” (the biggest change in slope). What values of k might be

reasonable according to this method? Note: there is not a single correct answer here; it is somewhat

open to interpretation and there is a range of reasonable answers.

3.3 k-medians

Rubric: {reasoning:5}

The data in clusterData2.pkl is the exact same as the above data, except it has 4 outliers that are very far

away from the data.

1. Using the plot 2dclustering function, output the clustering obtained by running k-means 50 times

(with k = 4) and taking the one with the lowest error. Are you satisfied with the result?

2. What values of k might be chosen by the elbow method for this dataset?

3. Implement the k-medians algorithm, which assigns examples to the nearest wc in the L1-norm and to

updates the wc by setting them to the “median” of the points assigned to the cluster (we define the

d-dimensional median as the concatenation of the median of the points along each dimension). Hand

in your code and plot obtained with 50 random initializations for k = 4.

4. Using the L1-norm version of the error (where yi now represents the closest median in the L1-norm),

f(w1, w2,…,wk, y1, y2,…,yn) = Xn

i=1

kxi wyi k1 = Xn

i=1

X

d

j=1

|xij wyij |,

what value of k would be chosen by the elbow method under this strategy? Are you satisfied with this

result?

3.4 Density-Based Clustering

Rubric: {reasoning:2}

If you run python main.py -q 3.4, it will apply the basic density-based clustering algorithm to the dataset

from the previous part. The final output should look somewhat like this:

(The right plot is zoomed in to show the non-outlier part of the data.) Even though we know that each

object was generated from one of four clusters (and we have 4 outliers), the algorithm finds 6 clusters and

6

does not assign some of the original non-outlier objects to any cluster. However, the clusters will change if

we change the parameters of the algorithm. Find and report values for the two parameters, eps (which we

called the “radius” in class) and minPts, such that the density-based clustering method finds:

1. The 4 “true” clusters.

2. 3 clusters (merging the top two, which also seems like a reasonable interpretaition).

3. 2 clusters.

4. 1 cluster (consisting of the non-outlier points).

3.5 Very-Short Answer Questions

Rubric: {reasoning:3}

1. Does the standard k-means clustering algorithm always yield the optimal clustering solution for a given

k?

2. If your set out to minimize the distance between each point and its mean in a k-means clustering, what

value of k minimizes this cost? Is this value useful?

3. Describe a dataset with k clusters where k-means would not be able to find the true clusters.

4. Suppose that you had only two features and that they have very-di↵erent scales (like kilograms vs.

milligrams). How would this a↵ect the result of density-based clustering?

5. Name a key advantage and drawback of using a supervised outlier detection method rather than an

unsupervised method?

4 Vector Quantization

Rubric: {reasoning:6}

Discovering object groups is one motivation for clustering. Another motivation is vector quantization, where

we find a prototype point for each cluster and replace points in the cluster by their prototype. If our inputs

are images, vector quantization gives us a rudimentary image compression algorithm.

Your task is to implement image quantization in quantize image.py with quantize and dequantize functions.

The quantize function should take in an image and, using the pixels as examples and the 3 colour channels

as features, run k-means clustering on the data with 2b clusters for some hyperparameter b. The code should

store the cluster means and return the cluster assignments. The dequantize function should return a version

of the image (the same size as the original) where each pixel’s orignal colour is replaced with the nearest

prototype colour.

To understand why this is compression, consider the original image space. Say the image can take on the

values 0, 1,…, 254, 255 in each colour channel. Since 28 = 256 this means we need 8 bits to represent each

colour channel, for a total of 24 bits per pixel. Using our method, we are restricting each pixel to only take

on one of 2b colour values. In other words, we are compressing each pixel from a 24-bit colour representation

to a b-bit colour representation by picking the 2b prototype colours that are “most representative” given the

content of the image. So, for example, if b = 6 then we have 4x compression.

The loaded image contains a 3D-array representing the RGB values of a picture. Implement the quantize

and dequantize functions and show the image obtained if you encode the colours using 1, 2, 4, and 6 bits

with the provided image. You are welcome to use the provided implementation of k-means or the scikit-learn

version.

7

1. Hand in your quantizeImage and deQuantizeImage functions.

2. Show the image obtained if you encode the colours using 1, 2, 4, and 6 bits per pixel (instead of the

original 24-bits).

3. Briefly comment on the prototype colours learned in case each, which are saved by the code.

8