Sale!

Introduction to Machine Learning HW3

$30.00

CS 189 Introduction to Machine Learning
HW3
This homework consists of coding assignments and math problems.
Begin early; you can submit models to Kaggle only twice a day!
DELIVERABLES:
1. Submit your predictions for the test sets to Kaggle as early as possible. Include your Kaggle
scores in your write-up. The Kaggle competition for this assignment can be found at
• MNIST: https://www.kaggle.com/c/spring21-cs189-hw3-mnist
• SPAM: https://www.kaggle.com/c/spring21-cs189-hw3-spam
2. Write-up: Submit your solution in PDF format to “Homework 3 Write-Up” in Gradescope.

Category:

Description

5/5 - (3 votes)

CS 189 Introduction to Machine Learning
HW3
This homework consists of coding assignments and math problems.
Begin early; you can submit models to Kaggle only twice a day!
DELIVERABLES:
1. Submit your predictions for the test sets to Kaggle as early as possible. Include your Kaggle
scores in your write-up. The Kaggle competition for this assignment can be found at
• MNIST: https://www.kaggle.com/c/spring21-cs189-hw3-mnist
• SPAM: https://www.kaggle.com/c/spring21-cs189-hw3-spam
2. Write-up: Submit your solution in PDF format to “Homework 3 Write-Up” in Gradescope.
• On the first page of your write-up, please list students with whom you collaborated
• Start each question on a new page. If there are graphs, include those graphs on the
same pages as the question write-up. DO NOT put them in an appendix. We need each
solution to be self-contained on pages of its own.
• Only PDF uploads to Gradescope will be accepted. You are encouraged use LATEX
or Word to typeset your solution. You may also scan a neatly handwritten solution to
produce the PDF.
• Replicate all your code in an appendix. Begin code for each coding question in a fresh
page. Do not put code from multiple questions in the same page. When you upload this
PDF on Gradescope, make sure that you assign the relevant pages of your code from
appendix to correct questions.
• While collaboration is encouraged, everything in your solution must be your (and only
your) creation. Copying the answers or code of another student is strictly forbidden.
Furthermore, all external material (i.e., anything outside lectures and assigned readings,
including figures and pictures) should be cited properly. We wish to remind you that
consequences of academic misconduct are particularly severe!
3. Code: Submit your code as a .zip file to “Homework 3 Code”.
• Set a seed for all pseudo-random numbers generated in your code. This ensures
your results are replicated when readers run your code.
• Include a README with your name, student ID, the values of random seed (above) you
used, and any instructions for compilation.
• Do NOT provide any data files. Supply instructions on how to add data to your code.
• Code requiring exorbitant memory or execution time might not be considered.
HW3, ©UCB CS 189, Spring 2021. All Rights Reserved. This may not be publicly shared without explicit permission. 1
• Code submitted here must match that in the PDF Write-up. The Kaggle score will not be
accepted if the code provided a) does not compile or b) compiles but does not produce
the file submitted to Kaggle.
4. The assignment covers concepts on Gaussian distributions and classifiers. Some of the material may not have been covered in lecture; you are responsible for finding resources to
understand it.
1 Honor Code
Declare and sign the following statement (Mac Preview and FoxIt PDF Reader, among others, have
tools to let you sign a PDF file):
“I certify that all solutions are entirely my own and that I have not looked at anyone else’s solution.
I have given credit to all external sources I consulted.”
Signature:
2 Gaussian Classification
Let f(x | Ci) ∼ N(µi
, σ2
) for a two-class, one-dimensional classification problem with classes C1
and C2, P(C1) = P(C2) = 1/2, and µ2 > µ1.
1. Find the Bayes optimal decision boundary and the corresponding Bayes decision rule by
finding the point(s) at which the posterior probabilities are equal.
2. Suppose the decision boundary for your classifier is x = b. The Bayes error is the probability
of misclassification, namely,
Pe = P((C1 misclassified as C2) ∪ (C2 misclassified as C1)).
Show that the Bayes error associated with this decision rule, in terms of b, is
Pe(b) =
1
2

2πσ


Z b
−∞
exp

−
(x − µ2)
2
2σ2


dx +
Z ∞
b
exp

−
(x − µ1)
2
2σ2


dx


.
3. Using the expression above for the Bayes error, calculate the optimal decision boundary b

that minimizes Pe(b). How does this value compare to that found in part 1? Hint: Pe(b) is
convex for µ1 < b < µ2.
3 Isocontours of Normal Distributions
Let f(µ, Σ) be the probability density function of a normally distributed random variable in R
2
.
Write code to plot the isocontours of the following functions, each on its own separate figure.
Make sure it is clear which figure belongs to which part. You’re free to use any plotting libraries
or stats utilities available in your programming language; for instance, in Python you can use
Matplotlib and SciPy.
HW3, ©UCB CS 189, Spring 2021. All Rights Reserved. This may not be publicly shared without explicit permission. 2
1. f(µ, Σ), where µ =


1
1


and Σ =


1 0
0 2

.
2. f(µ, Σ), where µ =


−1
2


and Σ =


2 1
1 4

.
3. f(µ1, Σ1) − f(µ2, Σ2), where µ1 =


0
2

, µ2 =


2
0


and Σ1 = Σ2 =


2 1
1 1

.
4. f(µ1, Σ1) − f(µ2, Σ2), where µ1 =


0
2

, µ2 =


2
0

, Σ1 =


2 1
1 1


and Σ2 =


2 1
1 4

.
5. f(µ1, Σ1) − f(µ2, Σ2), where µ1 =


1
1

, µ2 =


−1
−1

, Σ1 =


2 0
0 1


and Σ2 =


2 1
1 2

.
4 Eigenvectors of the Gaussian Covariance Matrix
Consider two one-dimensional random variables X1 ∼ N(3, 9) and X2 ∼
1
2
X1 + N(4, 4), where
N(µ, σ2
) is a Gaussian distribution with mean µ and variance σ
2
. Write a program that draws
n = 100 random two-dimensional sample points from (X1, X2) such that the ith value sampled from
X2 is calculated based on the ith value sampled from X1. In your code, make sure to choose and set
a fixed random number seed for whatever random number generator you use, so your simulation is
reproducible, and document your choice of random number seed and random number generator in
your write-up. For each of the following parts, include the corresponding output of your program.
(a) Compute the mean (in R
2
) of the sample.
(b) Compute the 2 × 2 covariance matrix of the sample.
(c) Compute the eigenvectors and eigenvalues of this covariance matrix.
(d) On a two-dimensional grid with a horizonal axis for X1 with range [−15, 15] and a vertical
axis for X2 with range [−15, 15], plot
(i) all n = 100 data points, and
(ii) arrows representing both covariance eigenvectors. The eigenvector arrows should originate at the mean and have magnitudes equal to their corresponding eigenvalues.
(e) Let U = [v1 v2] be a 2 × 2 matrix whose columns are the eigenvectors of the covariance
matrix, where v1 is the eigenvector with the larger eigenvalue. We use U
>
as a rotation
matrix to rotate each sample point from the (X1, X2) coordinate system to a coordinate system
aligned with the eigenvectors. (As U
> = U
−1
, the matrix U reverses this rotation, moving
back from the eigenvector coordinate system to the original coordinate system). Center your
sample points by subtracting the mean µ from each point; then rotate each point by U
>
,
giving xrotated = U
>
(x − µ). Plot these rotated points on a new two dimensional-grid, again
with both axes having range [−15, 15].
HW3, ©UCB CS 189, Spring 2021. All Rights Reserved. This may not be publicly shared without explicit permission. 3
In your plots, clearly label the axes and include a title. Moreover, make sure the horizontal
and vertical axis have the same scale! The aspect ratio should be one.
HW3, ©UCB CS 189, Spring 2021. All Rights Reserved. This may not be publicly shared without explicit permission. 4
5 Classification and Risk
Suppose we have a classification problem with classes labeled 1, . . . , c and an additional “doubt”
category labeled c + 1. Let r : R
d → {1, . . . , c + 1} be a decision rule. Define the loss function
L(r(x) = i, y = j) =



0 if i = j i, j ∈ {1, . . . , c},
λr
if i = c + 1,
λs otherwise,
where λr ≥ 0 is the loss incurred for choosing doubt and λs ≥ 0 is the loss incurred for making a
misclassification. Hence the risk of classifying a new data point x as class i ∈ {1, 2, . . . , c + 1} is
R(r(x) = i|x) =
Xc
j=1
L(r(x) = i, y = j) P(Y = j|x).
1. Show that the following policy obtains the minimum risk when λr ≤ λs
.
(a) Choose class i if P(Y = i|x) ≥ P(Y = j|x) for all j and P(Y = i|x) ≥ 1 − λr/λs
;
(b) Choose doubt otherwise.
2. What happens if λr = 0? What happens if λr > λs? Explain why this is consistent with what
one would expect intuitively.
6 Maximum Likelihood Estimation and Bias
Let X1, . . . , Xn ∈ R be n sample points drawn independently from normal distributions such that
Xi ∼ N(µ, σ2
i
), where σi = σ/ √
i for some parameter σ. (Every sample point comes from a
distribution with a different variance.)
(a) Derive the maximum likelihood estimates, denoted ˆµ and ˆσ, for the mean µ and the parameter
σ. You may write an expression for ˆσ
2
rather than ˆσ if you wish—it’s probably simpler that
way. Show all your work.
(b) Given the true value of a statistic θ and an estimator θˆ of that statistic, we define the bias of
the estimator to be the the expected difference from the true value. That is,
bias(θˆ) = E[θˆ] − θ.
We say that an estimator is unbiased if its bias is 0.
Either prove or disprove the following statement: The MLE sample estimator µˆ is unbiased.
Hint: Neither the true µ nor true σ
2 are known when estimating sample statistics, thus we
need to plug in appropriate estimators.
1. Either prove or disprove the following statement: The MLE sample estimator σˆ2
is unbiased.
Hint: Neither the true µ nor true σ
2 are known when estimating sample statistics, thus we
need to plug in appropriate estimators.
HW3, ©UCB CS 189, Spring 2021. All Rights Reserved. This may not be publicly shared without explicit permission. 5
7 Covariance Matrices and Decompositions
As described in lecture, the covariance matrix Var(R) ∈ R
d×d
for a random variable R ∈ R
d with
mean µ is
Var(R) = Cov(R, R) = E[(R − µ) (R − µ)
>
] =


Var(R1) Cov(R1, R2) . . . Cov(R1, Rd)
Cov(R2, R1) Var(R2) Cov(R2, Rd)
.
.
.
.
.
.
.
.
.
Cov(Rd, R1) Cov(Rd, R2) . . . Var(Rd)


,
where Cov(Ri
, Rj) = E[(Ri − µi) (Rj − µj)] and Var(Ri) = Cov(Ri
, Ri).
If the random variable R is sampled from the multivariate normal distribution N(µ, Σ) with the
PDF
f(x) =
1
p
(2π)
d
|Σ|
e
−((x−µ)

−1
(x−µ))/2
,
then Var(R) = Σ.
Given n points X1, X2, . . . , Xn sampled from N(µ, Σ), we can estimate Σ with the maximum likelihood estimator
Σ =ˆ
1
n
Xn
i=1
(Xi − µ) (Xi − µ)
>
,
which is also known as the covariance matrix of the sample.
(a) The estimate Σˆ makes sense as an approximation of Σ only if Σˆ is invertible. Under what
circumstances is Σˆ not invertible? Make sure your answer is complete; i.e., it includes all
cases in which the covariance matrix of the sample is singular. Express your answer in terms
of the geometric arrangement of the sample points Xi
.
(b) Suggest a way to fix a singular covariance matrix estimator Σˆ by replacing it with a similar but
invertible matrix. Your suggestion may be a kludge, but it should not change the covariance
matrix too much. Note that infinitesimal numbers do not exist; if your solution uses a very
small number, explain how to calculate a number that is sufficiently small for your purposes.
(c) Consider the normal distribution N(0, Σ) with mean µ = 0. Consider all vectors of length
1; i.e., any vector x for which kxk = 1. Which vector(s) x of length 1 maximizes the PDF
f(x)? Which vector(s) x of length 1 minimizes f(x)? Your answers should depend on the
properties of Σ. Explain your answer.
HW3, ©UCB CS 189, Spring 2021. All Rights Reserved. This may not be publicly shared without explicit permission. 6
8 Gaussian Classifiers for Digits and Spam
In this problem, you will build classifiers based on Gaussian discriminant analysis. Unlike Homework 1, you are NOT allowed to use any libraries for out-of-the-box classification (e.g. sklearn).
You may use anything in numpy and scipy.
The training and test data can be found with this homework. Don’t use the training/test data from
Homework 1, as they have changed for this homework. You can verify that your data files and
python environment are setup properly by running sanity.py. Submit your predicted class labels
for the test data on the Kaggle competition website and be sure to include your Kaggle display
name and scores in your writeup. Also be sure to include an appendix of your code at the end of
your writeup.
1. Taking pixel values as features (no new features yet, please), fit a Gaussian distribution to
each digit class using maximum likelihood estimation. This involves computing a mean and
a covariance matrix for each digit class, as discussed in lecture.
Hint: You may, and probably should, contrast-normalize the images before using their pixel
values. One way to normalize is to divide the pixel values of an image by the l2-norm of its
pixel values.
2. (Written answer + graph) Visualize the covariance matrix for a particular class (digit). How
do the diagonal terms compare with the off-diagonal terms? What do you conclude from
this?
3. Classify the digits in the test set on the basis of posterior probabilities with two different
approaches. Feel free to either use the starter code provided in starter.py or write your
own implementation from scratch
(a) (Graphs) Linear discriminant analysis (LDA). Model the class conditional probabilities
as Gaussians N(µC, Σ) with different means µC (for class C) and the same covariance
matrix Σ, which you compute by averaging the 10 covariance matrices from the 10
classes.
To implement LDA, you will sometimes need to compute a matrix-vector product of
the form Σ
−1
x for some vector x. You should not compute the inverse of Σ (nor the
determinant of Σ) as it is not guaranteed to be invertable. Instead, you should find a
way to solve the implied linear system without computing the inverse. Hint: How do
we solve OLS when the data matrix is singular?
Hold out 10,000 randomly chosen training points for a validation set (You may reuse your homework 1 solution or an out-of-the-box library for dataset splitting only).
Classify each image in the validation set into one of the 10 classes. Compute the error
rate (1−
# points correctly classified
# total points ) on the validation set and plot it over the following numbers
of randomly chosen training points: 100, 200, 500, 1,000, 2,000, 5,000, 10,000, 30,000,
50,000. (Expect some variation in your error rate when few training points are used.)
(b) (Graphs) Quadratic discriminant analysis (QDA). Model the class conditional probabilities as Gaussians N(µC, ΣC), where ΣC is the estimated covariance matrix for class C.
HW3, ©UCB CS 189, Spring 2021. All Rights Reserved. This may not be publicly shared without explicit permission. 7
(If any of these covariance matrices turn out singular, implement the trick you described
in Q7(b). You are welcome to use k-fold cross validation to choose the right constant(s)
for that trick.) Repeat the same tests and error rate calculations you did for LDA.
(c) (Written answer) Which of LDA and QDA performed better? Why?
(d) (Written answer + graph) Include a plot of validation error versus the number of training
points for each digit. Plot all the 10 curves on the same graph as shown in Figure 1.
Which digit is easiest to classify? Include written answers where indicated.
Figure 1: Sample graph with 10 plots
4. Using the mnist data.mat, train your best classifier for the training data and classify
the images in the test data. Submit your labels to the online Kaggle competition. Record
your optimum prediction rate in your submission. You are welcome to compute extra features
for the Kaggle competition, as long as they do not use an exterior learned model for their
computation (no transer learning!). If you do so, please describe your implementation in
your assignment. Please use extra features only for the Kaggle portion of the assignment.
5. Next, apply LDA or QDA (your choice) to spam. Submit your test results to the online
Kaggle competition. Record your optimum prediction rate in your submission. If you use
additional features (or omit features), please describe them.
Optional: If you use the defaults, expect relatively low classification rates. We suggest using
a Bag-Of-Words model. You are encouraged to explore alternative hand-crafted features, and
are welcome to use any third-party library to implement them, as long as they do not use a
separate model for their computation (no word-2-vec!).
HW3, ©UCB CS 189, Spring 2021. All Rights Reserved. This may not be publicly shared without explicit permission. 8