## Description

CS 189 Introduction to Machine Learning

HW1

This homework is comprised of a set of coding exercises and a few math problems. While we have

you train models across three datasets, the code for this entire assignment can be written in under

250 lines.

Start this homework early! You can submit to Kaggle only twice a day.

Deliverables:

1. Submit your predictions for the test sets to Kaggle as early as possible. Include your Kaggle

scores in your write-up (see below).

2. Submit a PDF of your homework, with an appendix listing all your code, to the Gradescope

assignment entitled “HW1 Write-Up”. You may typeset your homework in LaTeX or Word

or submit neatly handwritten and scanned solutions. Please start each question on a new

page. If there are graphs, include those graphs in the correct sections. Do not put them in an

appendix. We need each solution to be self-contained on pages of its own.

• On the first page of your write-up, please list students who helped you or whom you

helped on the homework. (Note that sending each other code is not allowed.)

• On the first page of your write-up, please copy the following statement and sign your

signature next to it. (Mac Preview and FoxIt PDF Reader, among others, have tools to

let you sign a PDF file.) We want to make extra clear the consequences of cheating.

“I certify that all solutions are entirely in my own words and that I have not looked at

another student’s solutions. I have given credit to all external sources I consulted.”

3. Submit all the code needed to reproduce your results to the Gradescope assignment entitled

“HW1 Code”. You must submit your code twice: once in your PDF write-up (above) so

the readers can easily read it, and again in compilable/interpretable form so the readers can

easily run it. Do NOT include any data files we provided. Please include a short file named

README listing your name, student ID, and instructions on how to reproduce your results.

Please take care that your code doesn’t take up inordinate amounts of time or memory.

The Kaggle score will not be accepted if the code provided a) does not compile or b) compiles

but does not produce the file submitted to Kaggle.

HW1, ©UCB CS 189, Spring 2021. All Rights Reserved. This may not be publicly shared without explicit permission. 1

Python Configuration and Data Loading

This section is only setup and has no requires no submitted solution. Please follow the instructions below to ensure your Python environment is configured properly, and you are able to

successfully load the data provided with this homework. For all coding questions, we recommend

using Anaconda for Python 3.

(a) Either install Anaconda for Python 3, or ensure you’re using Python 3. To ensure you’re running Python 3, open a terminal in your operating system and execute the following command:

python –version

Do not proceed until you’re running Python 3.

(b) Install the following dependencies required for this homework by executing the following command in your operating system’s terminal:

pip install scikit-learn scipy numpy matplotlib

Please use Python 3 with the modules specified above to complete this homework.

(c) You will be running out-of-the-box implementations of Support Vector Machines to classify

three datasets. You will find a set of .mat files in the data folder for this homework. Each

.mat file will load as a Python dictionary. Each dictionary contains three fields:

• training data, the training set features. Rows are sample points and columns are features.

• training labels, the training set labels. Rows are sample points. There is one column:

The labels corresponding to rows of training data above.

• test data, the test set features. Rows are sample points and columns are features. You will

fit a model to predict the labels for this test set, and submit those predictions to Kaggle.

The three datasets for the coding portion of this assignment are described below.

• mnist data.mat contains data from the MNIST dataset. There are 60,000 labeled digit

images for training, and 10,000 digit images for testing. The images are flattened grayscale,

28 ⇥ 28 pixels. There are 10 possible labels for each image, namely, the digits 0–9.

Figure 1: Examples from the MNIST dataset.

HW1, ©UCB CS 189, Spring 2021. All Rights Reserved. This may not be publicly shared without explicit permission. 2

• spam data.mat contains featurized spam data. The labels are 1 for spam and 0 for ham.

The data folder includes the script featurize.py and the folders spam, ham (not spam),

and test (unlabeled test data); you may modify featurize.py to generate new features for

the spam data.

• cifar10 data.mat contains data from the CIFAR10 dataset. There are 50,000 labeled object images for training, and 10,000 object images for testing. The images are flattened to

3⇥32⇥32 (3 color channels). The labels 0–9 correspond alphabetically to the categories.

For example, 0 means airplane, 1 means automobile, 2 means bird, and so on.

Figure 2: Examples from the CIFAR-10 dataset.

To check whether your Python environment is configured properly for this homework, ensure

the following Python script executes without error. Pay attention to errors raised when attempting to import any dependencies. Resolve such errors by manually installing the required

dependency (e.g. execute pip install numpy for import errors relating to the numpy package).

import sys

if sys.version_info[0] < 3:

raise Exception(“Python 3 not detected.”)

import numpy as np

import matplotlib.pyplot as plt

from sklearn import svm

from scipy import io

for data_name in [“mnist”, “spam”, “cifar10”]:

data = io.loadmat(“data/%s_data.mat” % data_name)

print(“\nloaded %s data!” % data_name)

fields = “test_data”, “training_data”, “training_labels”

for field in fields:

print(field, data[field].shape)

HW1, ©UCB CS 189, Spring 2021. All Rights Reserved. This may not be publicly shared without explicit permission. 3

For the entire assignment, you may only use sklearn for the SVM model and the accuracy

function. Everything else (generating plots) must be done without the use of sklearn.

1 Data Partitioning

Rarely will you receive “training” data and “validation” data; usually you will have to partition

available labeled data yourself. In this question, you will shu✏e and partition each of the datasets

in the assignment. Shu✏ing prior to splitting crucially ensures that all classes are represented in

your partitions.

(a) For the MNIST dataset, write code that sets aside 10,000 training images as a validation set.

(b) For the spam dataset, write code that sets aside 20% of the training data as a validation set.

(c) For the CIFAR-10 dataset, write code that sets aside 5,000 training images as a validation set.

Note: Make sure that you shu✏e the labels with the training images. It’s a very common error to

mislabel the training images by forgetting to permute the labels with the images!

You deliverable is to attach a copy of your data partitioning code to your homework report under

question 1.

2 Support Vector Machines: Coding

We will use linear Support Vector Machines to classify our datasets. For images, we will use the

simplest of features for classification: raw pixel brightness values. In other words, our feature

vector for an image will be a row vector with all the pixel values concatenated in a row major (or

column major) order.

There are several ways to evaluate models. We will use classification accuracy, or the percent

of examples classified correctly, as a measure of the classifier performance (see here: https://

scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html).

Error rate, or one minus the accuracy is another common metric.

Train a linear Support Vector Machine (SVM) on all three datasets. For each dataset, plot the

accuracy on the training and validation sets versus the number of training examples that you used

to train your classifier. The number of training examples to use are listed for each dataset in the

following parts.

You may only use sklearn for the SVM model and the sklearn.metrics.accuracy_score

function. Everything else (generating plots) must be done without the use of sklearn.

(a) For the MNIST dataset, use raw pixels as features. Train your model with the following

numbers of training examples: 100, 200, 500, 1,000, 2,000, 5,000, 10,000. At this stage, you

should expect accuracies between 70% and 90%.

HW1, ©UCB CS 189, Spring 2021. All Rights Reserved. This may not be publicly shared without explicit permission. 4

Hint: Be consistent with any preprocessing you do. Use either integer values between 0 and

255 or floating-point values between 0 and 1. Training on floats and then testing with integers

is bound to cause trouble.

(b) For the spam dataset, use the provided word frequencies as features. In other words, each

document is represented by a vector, where the i

th entry denotes the number of times word i (as

specified in featurize.py) is found in that document. Train your model with the following

numbers of training examples: 100, 200, 500, 1,000, 2,000, ALL.

Note that this dataset does not have 10,000 examples; use all of your examples instead of

10,000. At this stage, you should expect accuracies between 70% and 90%.

(c) For the CIFAR-10 dataset, use raw pixels as features. At this stage, you should expect accuracies between 25% and 35%. Be forewarned that training SVMs for CIFAR-10 takes a couple

minutes to run for a large training set locally. Train your model with the following numbers of

training examples: 100, 200, 500, 1,000, 2,000, 5,000.

Note: You can use either SVC(kernel=’linear’) or LinearSVC as your SVM model, though

they each solve slightly di↵erent optimization problems using di↵erent libraries. On MNIST,

LinearSVC was faster on one member of Course Sta↵’s laptop, though the exact results will likely

depend on your computer, the parameters of the algorithm, and the data (num data pts vs num

features).

Your deliverable for this part should be three plots showing number of examples versus training

and validation accuracy for each of the datasets. Additionally, be sure to include your code in the

“Code Appendix” portion of your write-up.

3 Hyperparameter Tuning

In the previous problem, you learned parameters for a model that classifies the data. Many classifiers also have hyperparameters that you can tune to influence the parameters. In this problem,

we’ll determine good values for the regularization parameter C in the soft-margin SVM algorithm.

When we are trying to choose a hyperparameter value, we train the model repeatedly with di↵erent

hyperparameters. We select the hyperparameter that gives the model with the highest accuracy on

the validation dataset. Before generating predictions for the test set, the model should be retrained

using all the labeled data (including the validation data) and the previously-determined hyperparameter.

The use of automatic hyperparameter optimization libraries is prohibited for this part of the

homework.

Your deliverable is as follows. For the MNIST dataset, find the best C value. In your report, list at

least 8 C values you tried, the corresponding accuracies, and the best C value. As in the previous

problem, for performance reasons, you are required to train with up to 10,000 training examples

but not required to train with more than that. Again, reference any code you used to perform a

hyper-parameter sweep in the code appendix.

HW1, ©UCB CS 189, Spring 2021. All Rights Reserved. This may not be publicly shared without explicit permission. 5

4 K-Fold Cross-Validation

For smaller datasets (e.g., the spam dataset), the validation set contains fewer examples, and our

estimate of our accuracy might not be accurate—the estimate has high variance. A way to combat

this is to use k-fold cross-validation.

In k-fold cross-validation, the training data is shu✏ed and partitioned into k disjoint sets. Then the

model is trained on k 1 sets and validated on the kth set. This process is repeated k times with

each set chosen as the validation set once. The cross-validation accuracy we report is the accuracy

averaged over the k iterations.

Use of automatic cross-validation libraries is prohibited for this part of the homework.

Your deliverable is as follows. For the spam dataset, use 5-fold cross-validation to find and report

the best C value. In your report, list at least 8 C values you tried, the corresponding accuracies,

and the best C value. Again, please include your code for cross validation or include a reference

to its location in your code appendix.

Hint: E↵ective cross-validation requires choosing from random partitions. This is best implemented by randomly shu✏ing your training examples and labels, then partitioning them by their

indices.

5 Kaggle

• MNIST Competition: https://www.kaggle.com/c/spring21-cs189-hw1-mnist

• SPAM Competition: https://www.kaggle.com/c/spring21-cs189-hw1-spam

• CIFAR-10 Competition: https://www.kaggle.com/c/spring21-cs189-hw1-cifar10

With the best model you trained for each dataset, generate predictions for the test sets we provide

and save those predictions to .csv files. Be sure to use integer labels (not floating-point!) and no

spaces (not even after the commas). Upload your predictions to the Kaggle leaderboards (submission instructions are provided within each Kaggle competition). In your write-up, include your

Kaggle name as it displays on the leaderboard and your Kaggle score for each of the three datasets.

General comments about the Kaggle sections of homeworks. Most or all of the coding homeworks will include a Kaggle section. Whereas other parts of a coding assignment might impose

strict limits about what methods you’re permitted to use, the Kaggle portions permit you to apply

your creativity and find clever ways to improve your leaderboard performance. The main restriction is that you cannot use an entirely di↵erent learning technique. For example, this is an SVM

homework, so you must use an SVM; you are not permitted to use a neural network or a decision

tree instead of (or in addition to) a support vector machine. (You are also not allowed to search for

the labeled test data and submit that to Kaggle; that’s outright cheating.)

For example, to achieve higher positions on the Kaggle leaderboards, you may optionally add

more features or use a nonlinear SVM kernel. Spam is a particularly good dataset for playing

with feature engineering; one easy way to perform better in spam/ham is to add extra features with

HW1, ©UCB CS 189, Spring 2021. All Rights Reserved. This may not be publicly shared without explicit permission. 6

featurize.py. Other examples of things you might investigate include SIFT and HOG features

for images, and a bag-of-words model for spam/ham. For reasons we’ll learn later this semester,

dropping features that have little or no predictive power will often improve your test performance

as much as adding the right new features. Although extensive creativity isn’t generally necessary

to get full points on an assignment, topping the Kaggle leaderboard gives your professor good

material for letters of recommendation.

Whatever creative ideas you apply, please explain what you did in your write-up. Cite any external

sources where you got ideas. If you have any questions about whether something is allowed or not,

ask on Piazza.

Remember to start early! Kaggle only permits two submissions per leaderboard per day. To help

you format the submission, please use check.py to run a basic sanity check on your submission

and save csv.py to help save your results.

To check your submission csv,

python check.py <competition name, eg. mnist> <submission csv file>

Your deliverable for this question has three parts. First, submit to all the Kaggle competitions

listed above. Second, include an explanation of what you tried, what worked, and what didn’t to

improve your accuracy. Finally, make sure to include all the code you used in the code appendix

and provide a reference to it.

6 Theory of Hard-Margin Support Vector Machines

A decision rule (or classifier) is a function r : Rd ! ±1 that maps a feature vector (test point) to

+1 (“in class”) or 1 (“not in class”). The decision rule for linear SVMs is of the form

r(x) =

8

>><

>>:

+1 if w · x + ↵ 0,

1 otherwise, (1)

where w 2 Rd and ↵ 2 R are the parameters of the SVM. The primal hard-margin SVM optimization problem (which chooses the parameters) is

min

w,↵ |w|

2 subject to yi(Xi · w + ↵) 1, 8i 2 {1,…, n}, (2)

where |w| = kwk2 = p

w · w.

We can rewrite this optimization problem by using Lagrange multipliers to eliminate the constraints. (If you’re curious to know what Lagrange multipliers are, the Wikipedia page is recommended, but you don’t need to understand them to do this problem.) We thereby obtain the

equivalent optimization problem

max

i0

min

w,↵ |w|

2

Xn

i=1

i(yi(Xi · w + ↵) 1). (3)

Note: i must be greater than or equal to 0.

HW1, ©UCB CS 189, Spring 2021. All Rights Reserved. This may not be publicly shared without explicit permission. 7

(a) Show that Equation (3) can be rewritten as the dual optimization problem

max

i0

Xn

i=1

i 1

4

Xn

i=1

Xn

j=1

ijyiyjXi · Xj subject to Xn

i=1

iyi = 0. (4)

Hint: Use calculus to determine and prove what values of w and ↵ optimize Equation (3).

Explain where the new constraint comes from.

SVM software usually solves this dual optimization program, not the primal optimization program.

(b) Suppose we know the values ⇤

i and ↵⇤ that optimize Equation (3). Show that the decision rule

specified by Equation (1) can be written

r(x) =

8

>><

>>:

+1 if ↵⇤ + 1

2

Pn

i=1 ⇤

i yiXi · x 0,

1 otherwise. (5)

(c) Applying Karush–Kuhn–Tucker (KKT) conditions (See Wikipedia for more information.), any

pair of optimal primal and dual solutions w⇤

, ↵⇤

, ⇤ for linear, hard-margin SVM must satisfy

the following condition:

⇤

i (yi(Xi · w⇤ + ↵⇤

) 1) = 0 8i 2 {1,…, n}

This condition is called complementary slackness. Explain what this implies for points corresponding to ⇤

i > 0. What relationship do they have with the margin?

(d) The training points Xi for which ⇤

i > 0 are called the support vectors. In practice, we frequently encounter training data sets for which the support vectors are a small minority of the

training points, especially when the number of training points is much larger than the number

of features. Explain why the support vectors are the only training points needed to evaluate the

decision rule.

(e) Assume the training points Xi and labels yi are linearly separable. Using the original SVM

formulation (not the dual) prove that there is at least one support vector for each class, +1 and

1.

Hint: Use contradiction. Construct a new weight vector w0 = w/(1 + ✏/2) and corresponding

bias ↵0 where ✏ > 0. It is up to you to determine what ✏ should be based on the contradiction.

If you provide a symmetric argument, you need only provide a proof for one of the two classes.

HW1, ©UCB CS 189, Spring 2021. All Rights Reserved. This may not be publicly shared without explicit permission. 8