## Description

Stat 437 Project 2

Your Name (Your student ID)

General rule and information

Due by 11:59PM, April 30, 2021. You must show your work in order to get points. Please prepare

your report according to the rubrics on projects that are given in the syllabus. If a project report

contains only codes and their outputs and the project has a total of 100 points, a maximum of 25

points can be taken off. Please note that your need to submit codes that would have been used for

your data analysis. Your report can be in .doc, .docx, .html or .pdf format.

The project will assess your skills in support vector machines and dimension reduction, for which

visualization techniques you have learnt will be used to illustrate your findings. This project gives

you more freedom to use your knowledge and skills in data analysis.

Task A: Analysis of gene expression data

For this task, you need to use PCA and Sparse PCA.

Data set and its description

Please download the data set “TCGA-PANCAN-HiSeq-801×20531.tar.gz” from the website https:

//archive.ics.uci.edu/ml/machine-learning-databases/00401/. A brief description of the data set is

given at https://archive.ics.uci.edu/ml/datasets/gene+expression+cancer+RNA-Seq.

You need to decompress the data file since it is a .tar.gz file. Once uncompressed, the data files

are “labels.csv” that contains the cancer type for each sample, and “data.csv” that contains the

“gene expression profile” (i.e., expression measurements of a set of genes) for each sample. Here each

sample is for a subject and is stored in a row of “data.csv”. In fact, the data set contains the gene

expression profiles for 801 subjects, each with a cancer type, where each gene expression profile

contains the gene expressions for the same set of 20531 genes. The cancer types are: “BRCA”,

“COAD”, “KIRC”, “LUAD” and “PRAD”. In both files “labels.csv” and “data.csv”, each row name

records which sample a label or observation is for.

Data processing

Please use set.seed(123) for random sampling via the command sample.

• Filter out genes (from “data.csv”) whose expressions are zero for at least 300 subjects, and

save the filtered data as R object “gexp2”.

• Use the command sample to randomly select 1000 genes and their expressions from “gexp2”,

and save the resulting data as R object “gexp3”.

1

• Use the command scale to standardize the gene expressions for each gene in “gexp3”. Save

the standardized data as R object “stdgexpProj2”.

You will analyze the standardized data.

Questions to answer when doing data analysis

Please also investigate and address the following when doing data analysis:

(1.a) Are there genes for which linear combinations of their expressions explain a significant proportion

of the variation of gene expressions in the data set? Note that each gene corresponds to a feature, and

a principal component based on data version is a linear combination of the expression measurements

for several genes.

(1.b) Ideally, a type of cancer should have its “signature”, i.e., a pattern in the gene expressions that

is specific to this cancer type. From the “labels.csv”, you will know which expression measurements

belong to which cancer type. Identify the signature of each cancer type (if any) and visualize it. For

this, you need to be creative and should try both PCA and Sparse PCA.

(1.c) There are 5 cancer types. Would 5 principal components, obtained either from PCA or Sparse

PCA, explain a dominant proportion of variability in the data set, and serve as the signatures of

the 5 cancer types? Note that the same set of genes were measured for each cancer type.

Identify patterns and low-dimensional structures

Please implement the following:

(2.a) Apply PCA, determine the number of principal components, provide visualizations of lowdimensional structures, and report your findings. Note that you need to use “labels.csv” for the task

of discoverying patterns such as if different cancer types have distinct transformed gene expressions

(that are represented by principal components). For PCA or Sparse PCA, low-dimensional structures

are usually represented by the linear space spanned by some principal components.

(2.b) Apply Sparse PCA, provide visualizations of low-dimensional structures, and report your

findings. Note that you need to use “labels.csv” for the task of discoverying patterns. Your

laptop may not have sufficient computational power to implement Sparse PCA with many principal

components. So, please pick a value for the sparsity controlling parameter and a value for the

number of principal components to be computed that suit your computational capabilities.

(2.c) Do PCA and Sparse PCA reveal different low-dimensional structures for the gene expressions

for different cancer types?

Task B: analysis of SPAM emails data set

For this task, you need to use PCA and SVM.

2

Dataset and its description

The spam data set “SPAM.csv” is attached and also can be downloaded from https://web.stanford.

edu/~hastie/CASI_files/DATA/SPAM.html. More information on this data set can be found at:

https://archive.ics.uci.edu/ml/datasets/Spambase. The column “testid” in “SPAM.csv” was used

to train a model when the data set was used by other analysts and hence should not be used as

a feature or the response, the column “spam” contains the true status for each email, and the

rest contain measurements of features. Here each email is represented by a row of features in the

.csv file, and a “feature” can be regarded as a “predictor”. Also note that the first 1813 rows, i.e.,

observations, of the data set are for spam emails, and that the rest for non-spam emails.

Data processing

Please do the following:

• Remove rows that have missing values. For a .csv file, usually a blank cell is treated as a

missing value.

• Check for highly correlated features using the absolute value of sample correlation. Think

about if you should include all or some of highly correlated features into an SVM model. For

example, “crl.ave” (average length of uninterrupted sequences of capital letters), “crl.long”

(length of longest uninterrupted sequence of capital letters) and “crl.tot” (total number of

capital letters in the e-mail) may be highly correlated. Whethere you choose to remove some

highly correlated features from subsequent analysis or not, you need to provide a justification

for your choice.

Note that each feature is stored in a column of the original data set and each observation in a row.

You will analyze the processed data set.

Classifiction via SVM

Please do the following:

(3.a) Use set.seed(123) wherever the command sample is used or cross-validation is implemented,

randomly select without replacement 300 observations from the data set and save them as training

set “train.RData”, and then randomly select without replacement 100 observations from the

remaining observations and save them as “test.RData”. You need to check if the training set contains

observations from both classes; otherwise, no model can be trained.

(3.b) Apply PCA to the training data “train.RData” and see if you find any pattern that can be

used to approximately tell a spam email from a non-spam email.

(3.c) Use “train.RData” to build an SVM model with linear kernel, whose cost parameter is

determined by 10-fold cross-validation, for which the features are predictors, the status of email is

the response, and cost ranges in c(0.01,0.1,1,5,10,50). Apply the obtained optimal model to

“test.RData”, and report via a 2-by-2 table on spams that are classified as spams or non-spams and

on non-spams that are classified as non-spams or spams.

(3.d) Use “train.RData” to build an SVM model with radial kernel, whose “cost” parameter is

determined by 10-fold cross-validation, for which the features are predictors, the status of email is

3

the response, cost ranges in c(0.01,0.1,1,5,10,50), and gamma=c(0.5,1,2,3,4). Report the

number of support vectors. Apply the obtained optimal model to “test.RData”, and report via a

2-by-2 table on spams that are classified as spams or non-spams and on non-spams that are classified

as non-spams or spams.

(3.e) Compare and comment on the classification results obtained by (3.c) and (3.d).

4