## Description

COMS 4771-2 Homework 1

Instructions

Submit your write-up on Gradescope as a neatly typeset (not scanned nor handwritten) PDF document by

11:59 PM of the due date.

On Gradescope, be sure to select the pages containing your answer for each problem. More details can be

found on the Gradescope Student Workflow help page:

• https://gradescope.com/help#help-center-section-student-workflow

(If you don’t select the pages containing your answer to a problem, you’ll receive a zero for that problem.)

Make sure your name and your UNI appears prominently on the first page of your write-up.

Source code

Please combine all requested source code files into a single ZIP file1

, along with a plain text file called README

that contains your name and briefly describes all of the other files in the ZIP file. Do not include the data

files. Submit this ZIP file on Courseworks.

Clarity and precision

One of the goals in this class is for you to learn to reason about machine learning problems and algorithms.

To reason about these things, you must be able to make clear and precise claims and arguments about them.

A clear and precise argument is not the same as a long, excessively detailed argument. Unnecessary details

and irrelevant side-remarks often make an argument less clear. And non-factual statements also detract from

the clarity of an argument.

Points may be deducted for answers and arguments that lack sufficient clarity or precision. Moreover, a

time-economical attempt will be made to interpret such answers/arguments, and the grade you receive will

be based on this interpretation.

1See https://en.wikipedia.org/wiki/Zip_(file_format).

1

Problem 1 (30 points)

In this problem, you will implement and evaluate the nearest neighbor rule.

MNIST data set

Download the MNIST data set of handwritten digit images ocr.mat from the course website, and load it

into Python:

from scipy.io import loadmat

ocr = loadmat(‘ocr.mat’)

The 60000 unlabeled training data (i.e., feature vectors) are contained in a matrix called data (one data point

per row), and the corresponding labels are in a vector called labels. The 10000 test feature vectors and

labels are in, respectively, testdata and testlabels.

In Python, you can view an image (say, the first one) with the following commands:

import matplotlib.pyplot as plt

from matplotlib import cm

plt.imshow(ocr[‘data’][0].reshape((28,28)), cmap=cm.gray_r)

plt.show()

Nearest neighbor classifier

Let the training examples be denoted by (x1, y1), . . . ,(xn, yn) (for n = 60000), where each xi ∈ R

d and each

yi ∈ {0, . . . , 9}.

The nearest neighbor classifier ˆf : R

d → {0, . . . , 9} is a function defined in terms of the 60000 training

examples as follows. On input x ∈ R

d

, let i(x) := arg mini∈{1,…,n} kx − xik2, breaking ties arbitrarily (i.e.,

i(x) is any i ∈ {1, . . . , n} for which kx − xi(x)k2 ≤ mini∈{1,…,n} kx − xik2); and return yi(x)

.

2

Write a program in Python that implements the nearest neighbor classifier. Your program should take as

input a matrix of training feature vectors X and a vector of corresponding labels Y, as well as a matrix of test

feature vectors test. The output of the program should be a vector of the predicted labels preds for all test

points, as provided by the nearest neighbor classifier ˆf based on the examples in X and Y.

You should not use (or look at the source code for) any library functions for computing all pairwise distances

between entire sets of vectors, or for computing nearest neighbor queries—that would defeat the purpose of

this assignment. However, you can use functions for computing inner products between vectors and norms of

vectors, as well as other basic matrix and vector operations. If in doubt what is okay to use, just ask.

In order to make your code efficient and not take a long time to run, you should use matrix/vector operations

(rather than, say, a bunch of explicit for-loops). Think of the n training feature vectors as being stacked as

the rows of a matrix X ∈ R

n×d

, and the m test feature vectors as the rows of another matrix T ∈ R

m×d

. If

the i-th row of T is t

i

and the j-th row of X is x

j

, then the squared Euclidean distance between ti and xj is

kti − xjk

2

2 = (ti − xj )

(ti − xj ) = t

i

ti − 2t

i xj + x

j xj .

Can you leverage this identity to speed-up computation of the nearest neighbor classifier?3

2The expression “arg mina∈A f(a)” denotes the set of minimizers of the function f among all elements in the set A. On

occasion, as we do here, we’ll be a little sloppy and write expressions of the form “aˆ := arg mina∈A f(a)”. This will always means

that aˆ is some minimizer of f within A, rather than the set of minimizers. For this to really make sense, though, it must be

made clear (either explicitly or implicitly) what we mean in the case that there are multiple minimizers (or if there are none!).

3

In order to take advantage of matrix/vector operations, your Python installation needs to be using optimized linear algebra

libraries, such as ATLAS, MKL, or the Accelerate/vecLib framework on OS X. If you installed Python using Anaconda, then

you should already have MKL.

2

Evaluating the nearest neighbor classifier

Instead of using your nearest neighbor program with data and labels as the training data, do the following.

For each value n ∈ {1000, 2000, 4000, 8000}:

1. Draw n points from data, together with their corresponding labels, uniformly at random. Use sel =

random.sample(range(60000,n) (after import random), ocr[‘data’][sel].astype(‘float’), and

ocr[‘labels’][sel] to select the examples.4

2. Use these n points as the training data and testdata as the test points; compute the fraction of test

examples on which the nearest neighbor classifier predicts the label incorrectly (i.e., the test error rate).

A plot of the test error rate (on the y-axis) as a function of n (on the x-axis) is called a learning curve.

Carry out the (random) process described above ten times, independently. Produce a learning curve plot

using the average of these test error rates (that is, averaging over ten repetitions). Add error bars to your plot

that extend to one standard deviation above and below the means. Ensure the plot axes are properly labeled.

What to submit in your write-up:

(a) A brief description of how you implemented the nearest neighbor classifier (especially the “arg min”).

Note that it is fine to use numpy.argmin, but you should explain precisely how you use it in your code

to efficiently implement the nearest neighbor classification of new (test) points.

(b) Learning curve plot. (Please include the plot directly in the PDF write-up.)

(c) What would the learning curve look like if, instead of test error rate, you computed training error rate?

Please submit your source code on Courseworks.

4The data are stored as unsigned integers, so you need to convert them to floating point or else you may get integer overflows.

3

Problem 2 (35 points)

In this problem, you will practice deriving formulae for maximum likelihood estimators.

Consider a statistical model for iid random variables X1, . . . , Xn, parameterized by θ ∈ Θ. The distribution

Pθ of X1 is specified in each part below.

In each of the first two cases, derive a simple formula for the MLE for θ (given data (X1, . . . , Xn) =

(x1, . . . , xn)), and briefly justify each step of the derivation.

(a) X1 ∼ Pθ has probability density function fθ satisfying

fθ(x) ∝ x

2

e

−x/θ for all x 0,

and fθ(x) = 0 for all x ≤ 0.

5 The parameter space is the positive reals Θ = {θ ∈ R : θ 0}.

(b) X1 ∼ Pθ has probability density function fθ satisfying

fθ(x) ∝ 1 for all 0 ≤ x ≤ θ,

and fθ(x) = 0 for all x < 0 and all x θ. The parameter space is the positive reals Θ = {θ ∈ R : θ 0}.

In each of the next two cases, specific values of the data are given as follows:

(X1, . . . , Xn) = (0, 1, 1, 0, 2, 2, 1, 2, 1, 0, 2, 0, 2, 2, 0, 3, 0, 0, 0, 2).

Derive the actual MLE value of θ given this data (as opposed to an abstract formula), and briefly justify

each step of the derivation.

(c) X1 ∼ Pθ has probability density function fθ satisfying

fθ(x) ∝ e

−(x−θ)

2/(2θ

2

)

for all x ∈ R.

The parameter space is Θ = {θ ∈ R : θ 6= 0} (i.e., all real numbers θ such that θ 6= 0).

(d) X1 ∼ Pθ has probability mass function pθ satisfying

pθ(0) = 1

2

, pθ(1) = e

θ

, pθ(2) = 2e

θ

, pθ(3) = 1

2

− 3e

θ

.

The parameter space is Θ = {θ ∈ R : θ < − ln(6)}.

5The symbol “∝” means “proportional to”. Remember, probability density functions should integrate to one.

4

Problem 3 (35 points)

In this problem, you will implement a learning algorithm based on generative models for classification, apply

the learning algorithm to a simple data set, and quantitatively and qualitatively study the learned classifiers.

Download the “20 Newsgroups data set” news.mat from Courseworks. The training feature vectors/labels

and test feature vectors/labels are stored as data/labels and testdata/testlabels. Each data point

corresponds to a message posted to one of 20 different newsgroups (i.e., message boards). The representation

of a message is a (sparse) binary vector in X := {0, 1}

d

(for d := 61188) that indicates the words that are

present in the message. If the j-th entry in the vector is 1, it means the message contains the word that is

given on the j-th line of the text file news.vocab. The class labels are Y := {1, 2, . . . , 20}, where the mapping

from classes to newsgroups is in the file news.groups (which we won’t actually need).

In this problem, you’ll develop a classifier based on a Naive Bayes generative model. Here, we use class

conditional distributions with probability mass functions of the form pµ(x) = Qd

j=1 µ

xj

j

(1 − µj )

1−xj

for

x = (x1, x2, . . . , xd) ∈ X . The parameter vector is µ = (µ1, µ2, . . . , µd) must come from the parameter

space [0, 1]d

. Since there are 20 classes, the generative model is actually parameterized by 20 such vectors,

µy = (µy,1, µy,2, . . . , µy,d) for each y ∈ Y, as well as the class prior parameters, πy for each y ∈ Y. The class

prior parameters, of course, must satisfy πy ∈ [0, 1] for each y ∈ Y and P

y∈Y πy = 1.

(a) Give the formula for the MLE of the parameter µy,j based on training data {(xi

, yi)}

n

i=1. (Remember,

each unlabeled point is a vector: xi = (xi,1, xi,2, . . . , xi,d) ∈ {0, 1}

d

.)

(b) MLE is not a good estimator for the class conditional parameters if the estimate turns out to be zero or

one. An alternative is the following estimator based on a technique called Laplace smoothing:

µˆy,j := ?

1 +Xn

i=1

1{yi = y}xi,j?

/

?

2 +Xn

i=1

1{yi = y}

?

∈ (0, 1).

Write code for training and testing a classifier based on the Naive Bayes generative model described

above. Use Laplace smoothing to estimate class conditional distribution parameters, and MLE for class

prior parameters. You should not use or look at any existing implementation (e.g., those that may be

provided as library functions). Using your code, train and test a classifier with the data from news.mat.

What to submit: training and test error rates.

(c) Consider the binary classification problem, where newsgroups {1, 16, 20} comprise the “negative class”

(class 0), and newsgroups {17, 18, 19} comprise the “positive class” (class 1). Newsgroups {1, 16, 20}

are “religious” topics, and newsgroups {17, 18, 19} are “political” topics. Modify the data in news.mat

to create the training and test data sets for this problem. Using these data and your codes from part

(b), train and test a Naive Bayes classifier. Report the training and test error rates. Save the learned

classifier for part (d).

What to submit: training and test error rates.

(d) The classifier you learn is ultimately an affine classifier, which means it has the following form:

x 7→

(

0 if α0 +

Pd

j=1 αjxj ≤ 0

1 if α0 +

Pd

j=1 αjxj 0

for some real numbers α0, α1, . . . , αd. Determine the values of these αj ’s for your learned classifier from

part (c). Then, report the vocabulary words whose indices j ∈ {1, 2, . . . , d} correspond to the 20 largest

(i.e., most positive) αj value, and also the vocabulary words whose indices j ∈ {1, 2, . . . , d} correspond

to the 20 smallest (i.e., most negative) αj value. Don’t report the indices j’s, but rather the actual

vocabulary words (from news.vocab).

What to submit: two ordered lists (clearly labeled) of 20 words each.

Please also submit your source code on Courseworks.

5