## Description

Stat 437 HW5

Your Name (Your student ID)

General rule

Due by 11:59pm Pacific Standard Time, May 2, 2021. Please show your work and submit your

computer codes in order to get points. Providing correct answers without supporting details does

not receive full credits. This HW covers

• support vector machines

• neural networks

• principal component analysis

You DO NOT have to submit your HW answers using typesetting software. However, your answers

must be legible for grading. Please upload your answers to the course space.

For exercises on the Text, there might be solutions to them that are posted online. Please do not

plagiarize those solutions.

Conceptual exercises: I (support vector machines)

1.1) State the mathematical definition of a hyperplane. Describe the classification rule that is induced

by a hyperplane. How does the classification rule involve the normal vector of the hyperplane? (Hint:

you can use information on page 12 of Lecture Notes 6 to find the normal vector of a hyperplane

and then use information on pages 4 and 5 of Lecture Notes 6.)

1.2) Consider a two-class classification problem where observations {(xi

, yi)}

n

i=1 can be completely

separated by a hyperplane. Consider a hyperplane S = {x ∈ R

p

: hx, αi + β0 = 0} with direction α

and intercept β0. Explain why the distance from xi to S is

dist(xi

, S) = yi(hxi

, αi + β0)

when kαk = 1? (Hint: you can read through pages 11 and 12 of Lecture Notes 6 and watch

corresponding lecture video clips.)

1.3) Consider a two-class classification problem where observations {(xi

, yi)}

n

i=1 can be completely

separated by a hyperplane. Consider a hyperplane S = {x ∈ R

p

: hx, αi + β0 = 0}. Why there are

infinitely many separating hyperplanes for these observations? What is the optimization problem

that the maximal margin classifier tries to solve? State the optimization problem mathematically

and explain the meaning of each term in the mathematical formulation. (Hint: you need to first

set up notations and then you can use information on page 13 of Lecture Notes 6.) Why does

the optimization problem have the constraint kαk = 1? (Hint: you can use partial answer to 1.2)

above.) Explain why the optimal hyperplane of the maximal margin classifier is equal distance from

either class of observations. (Hint: you can use information on pages 9 and 10 of Lecture Notes 6.)

1.4) Consider a two-class classification problem where observations can be completely separated by

a hyperplane. What are support vectors of the maximal margin classifier? Explain how you move

support vectors to change and not to change the maximal margin classifier, respectively.

1

1.5) Consider a two-class classification problem where observations {(xi

, yi)}

n

i=1 can not be completely

separated by a hyperplane. What optimization problem does a support vector classifier (SVC)

try to solve? State it mathematically and explain the meaning of each term in the mathematical

formulation. Explain how the value of a slack variable reveals how its associated observation is

classified by the resulting SVC, and explain how the value of the tolerance affects classification of

xi

’s, the number of support vectors, and the margin of the resulting SVC. (Note: please do NOT

just copy contents from the lecture notes and paste them as your answers.)

1.6) Consider a two-class classification problem where observations {(xi

, yi)}

n

i=1 can not be completely

separated by a hyperplane. When constructing an SVC by solving the optimization problem via

Lagrange multipliers, there is a “cost” parameter C. Explain how the value of the cost C affects

classification of xi

’s, the number of support vectors, and the margin of the resulting SVC. Is this C

the same as the tolerance mentioned in 1.5)?

1.7) Consider a two-class classification problem where training observations {(xi

, yi)}

n

i=1 can not

be completely separated by a hyperplane. When the decision boundary between the two classes

is nonlinear, what can you to to an SVC in order to deal with this situation, and what are some

disadvantages of what you propose to do? Is it true that an SVM is able to deal with this situation

and that it does so by implicitly enlarging the feature space using a kernel that can be different

from the Euclidean inner product? Provide a linear representation of an SVM, and comment on

how this representation is different from and similar to that for an SVC, respectively.

1.8) Describe how to conduct multi-class classification using SVMs.

Conceptual exercises: II (neural networks)

2.1) Describe how derived features are obtained by a vanilla, feedforward neural network that has 3

layers in total and 1 hidden layer.

2.2) Provide a criterion that is used to train a neural network for classification and for regression,

respectively.

2.3) What are some issues on training a neural network by optimizing a criterion you presented in

2.2), and how to deal with them?

Conceptual exercises: III (principal component analysis)

Assume there are p feature variables X1, . . . , Xp that are stored in the vector X = (X1, . . . , Xp)

T

.

Let X be an n × p data matrix whose ith row is the ith observation on X. Assume the covariance

matrix of X is Σ.

3.1) Describe in detail the population version of principal component analysis (PCA).

3.2) Provide the sample covariance matrix of X that is obtained from X. Describe in detail the

data version of PCA.

3.3) In the population version of PCA, the first principal component is a scalar random variable,

whereas in the data version of PCA, we have n scores for the first principal component. How are

the first principal component and its n scores related?

2

3.4) What does a biplot plot? How can you discover patterns in data using a biplot?

3.5) When implementing the data version of PCA based on X, is it recommended to center and

scale the observations in X? If so, how and why?

3.6) What is a criterion to use to choose the number of principal components?

3.7) Consider the scalar random variable w = a

T X for a ∈ R

p

. We want to find a ∈ R

p

for which

the variance of w is maximized. Explain why a should be an eigenvector associated with the largest

eigenvalue λ1 of Σ.

3.8) State the model and optimization problem PCA tries to solve when it is interpreted as the best

linear approximate to X under the Frobenius norm among all subspace of dimension q < p. How is

this optimization problem related to regression modeling based on the least squares method?

Applied exercises

Consider the data set iris from the R library ggplot2. Here is the instructor’s ggplot2 version

packageVersion(“ggplot2”)

## [1] ‘3.1.0’

You can use help(iris) to obtain some help information on this data set, or you can do the

following:

library(ggplot2)

data(iris)

head(iris)

## Sepal.Length Sepal.Width Petal.Length Petal.Width Species

## 1 5.1 3.5 1.4 0.2 setosa

## 2 4.9 3.0 1.4 0.2 setosa

## 3 4.7 3.2 1.3 0.2 setosa

## 4 4.6 3.1 1.5 0.2 setosa

## 5 5.0 3.6 1.4 0.2 setosa

## 6 5.4 3.9 1.7 0.4 setosa

unique(iris$Species)

## [1] setosa versicolor virginica

## Levels: setosa versicolor virginica

From the iris data set, pick all observations for the subspecies setosa or versicolor. This gives a

subset of 100 observations. From this subset, use set.seed(123) to randomly select 40 observations

for each of the 2 subspecies, and put the 80 observations thus obtained into a training set. The

remaining 20 observations in the subset then form a test set.

(4.1) Build an SVM using the training set with cost C = 0.1 and apply the obtained model to

the test set. Report classification results on the test set and provide needed visualizations. (Note:

plot for svm is not designed for more than 2 features, i.e., when an svm is built using more than 2

features and you apply plot to an svm object, you will get an error.)

3

(4.2) Build an SVM using the training set by 10-fold cross-validation and by setting set.seed(123),

in order to find the optimal value for the cost C from the range:

ranges = list(cost = c(0.001, 0.01, 0.1, 1, 5, 10, 100))

Apply the model to the test set, and report classification results on the test set. Do you think an

SVM with a nonlinear decision boundary should be used for this classification task? If so, please

use an SVM with a radial kernel whose parameters are determined by 10-fold cross-validation on

the training set and by setting set.seed(123).

(4.3) Use the training set, use set.seed(123), and apply 5-fold cross-validation to build an optimal

neural network model with 2 hidden layers of 5 and 7 hidden neurons, respectively. Apply the

optimal neural network model to the test set and report classification results. Note that you need to

make sure you know how the class labels are ordered by R and that this is explained in the lecture

video “Stat 437 Video 27b: neural network example 1”.

(4.4) Apply PCA to all features of the full data set iris. Plot the first two principal components

against each other by coloring each point on the plot by its corresponding subspecies. Do these

principal components reveal any systematic pattern on the features for any subspecies? Plot the

cumulative percent of variation explained by all (successively ordered) principal components.

4