## Description

Stat 437 HW3

Your Name (Your student ID)

General rule

Due by 11:59pm Pacific Standard Time, March 4, 2021. Please show your work and submit your

computer codes in order to get points. Providing correct answers without supporting details does

not receive full credits. This HW covers

• K-means clustering

• Hierarchical clustering

You DO NOT have to submit your HW answers using typesetting software. However, your answers

must be legible for grading. Please upload your answers to the course space.

Conceptual exercises

1. Consider the K-means clustering methodology.

1.1) Give a few examples of dissimilarity measures that can be used to measure how dissimilar two

observations are. What is the main disadvantage of the squared Euclidean distance as a dissimilarity

measure?

1.2) Is it true that standardization of data should be done when features are measured on very

different scales? Is it true that employing more features gives more accurate clustering results? Is it

true that employing standardized observations gives more accurate clustering results than employing

non-standardized ones? Explain each of your answers.

1.3) Take K = 2. Provide the loss function that K-means clustering tries to minimize. You need to

provide the definition and meaning of each term that appears in the loss function.

1.4) What is the “centroid” for a cluster? Is the algorithm, Algorithm 10.1 on page 388 of the Text

(which is also provided in the lecture slides), guaranteed to converge to the global minimum of the

loss function? Why or why not? What does the argument nstart refer to in the command kmeans?

Why is nstart suggested to take a relatively large value? Why do you need to set a random seed

by set.seed() before you apply kmeans?

1.5) Suppose there are 2 underlying clusters but you set the number of clusters to be different than

2 and apply kmeans, will you have good clustering results? Why or why not?

1.6) Is the true number K0 of clusters in data known? When using the command clusGap to

estimate K0, what does its argument B refer to?

2. Consider hierarchical clustering.

2.1) What are some advantages of hierarchical clustering over K-means clustering? What is the

relationship between the dissimilarity between two clusters and the height of these clusters in the

dendrogram that represents a bottom-up tree?

1

2.2) Explain what it means by saying that “the clusters obtained at different heights from a

dendrogram are nested”. If a data set has two underlying clustering structures that can be obtained

by two different criteria, will these two sets of clusters necessarily be nested? Explain your answer.

2.3) Why is the distance based on Pearson’s sample correlation not effected by the magnitude of

observations in terms of Euclidean distance? What is the definition of average linkage? Why are

average linkage and complete linkage preferred than single linkage in practice?

2.4) What does the command scale do? Does scale apply row-wise or column-wise? When scale

is applied to a variable, what will happen to the observations of the variable?

2.5) What is hclust$height? How do you find the height at which to cut a dendrogram in order

to obtain 5 clusters?

2.6) When creating a dendrogram, what are some advantages of the command ggdendrogram{ggdendro}

over the R base command plot?

Visualizing clustering results as, e.g., done by Example 1 in “LectureNotes3_notes.pdf”.

Applied exercises

3. Please refer to the NYC flight data nycflights13 that has been discussed in the lecture notes and

whose manual can be found at https://cran.r-project.org/web/packages/nycflights13/index.html.

We will use flights, a tibble from nycflights13.

Select from flights observations that are for 3 carrier “UA”, “AA” or “DL”, for month 7 and

2, and for 4 features dep_delay, arr_delay, distance and air_time. Let us try to see if we can

use the 4 features to identify if an observation belongs a specific carrier or a specific month. The

following tasks and questions are based on the extracted observations. Note that you need to remove

na’s from the extracted observations.

3.1) Apply K-means with K = 2 and 3 respectively but all with set.seed(1) and nstart=20. For

K = 3, provide visualization of the clustering results based on true clusters given by carrier,

whereas for K = 2, provide visualization of the clustering results based on true clusters given by

month. Summarize your findings based on the clustering results. You can use the same visualization

scheme that is provided by Example 2 in “LectureNotes3_notes.pdf”. Try visualization based on

different sets of 2 features if your visualization has overlayed points.

3.2) Use set.seed(123) to randomly extract 50 observations, and to these 50 observations, apply

hierarchical clustering with average linkage. (i) Cut the dendrogram to obtain 3 clusters with leafs

annotated by carrier names and resulting clusters colored distinctly, and report the corresponding

height of cut. (ii) In addition, cut the dendrogram to obtain 2 clusters with leafs annotated by

month numbers and resulting clusters colored distinctly, and report the corresponding height of cut.

Here are some hints: say, you save the randomly extracted 50 observations into an object ds3sd, for

these observations save their carrier names by keeping their object type but save month numbers

as a character vector, make sure that ds3sd is a matrix, transpose ds3sd into tmp, assign to

tmp column names with their corresponding carrier names or month numbers, and then transpose

tmp and save it as ds3sd; this way, you are done assigning cluster labels to each observation in

ds3sd; then you are ready to use the commands in the file Plotggdendro.r to create the desired

dendrograms.

2