## Description

CSE4334/5334 Data Mining Assignment 1

What to turn in:

1. Your submission should include your complete code base in an archive file (zip, tar.gz) and q1/,

q2/, and so on), and a very very clear README describing how to run it.

2. A brief report (typed up, submit as a PDF file, NO handwritten scanned copies) describing what you solved, implemented and known failure cases.

3. Submit your entire code and report to Blackboard.

Notes from instructor:

• Start early!

• You may ask the TA or instructor for suggestions, and discuss the problem with others (minimally).

But all parts of the submitted code must be your own.

• Use Matlab or Python for your implementation.

• Make sure that the TA can easily run the code by plugging in our test data.

Problem 1

(k-means, 40pts) Generate 2 sets of 2-D Gaussian random data, each set containing 500 samples using

parameters below.

µ1 = [1, 0], µ2 = [0, 1.5], Σ1 =

?

0.9 0.4

0.4 0.9

?

, Σ2 =

?

0.9 0.4

0.4 0.9

?

(1)

1. (20pts) Write a function cluster = mykmeans(X, k, c) that clusters data X ∈ R

n×p

(n number of

objects and p number of attributes) into k clusters. The c here is the initial centers, although this is

usually not necessary, we will need it to test your function. Terminate the iteration when the `2-norm

between a previous center and an updated center is ≤ 0.001 or the number of iteration reaches 10000.

2. (10pts) Apply your code to the data generated above with k = 2 and initial centers c1 = (10, 10) and

c2 = (−10, −10). In your report, report the centers found for each cluster. How many iterations did it

take? Show a scatter plot of the data and the centers of clusters found.

3. (10pts) Apply your code to the data generated above with k = 4 and initial centers c1 = (10, 10) and

c2 = (−10, −10), c3 = (10, −10) and c4 = (−10, 10). In your report, report the centers found for each

cluster. How many iterations did it take? Show a scatter plot of the data and the centers of clusters

found.

Problem 2

(Non-parameteric density estimation 60pts)

1. (30pts) Write a function [p, x] = mykde(X,h) that performs kernel density estimation on X with

bandwidth h. It should return the estimated density p(x) and its domain x where you estimated the

p(x) for X in 1-D and 2-D.

2. (10pts) Generate N = 1000 Gaussian random data with µ1 = 5 and σ1 = 1. Test your function mykde

on this data with h = {.1, 1, 5, 10}. In your report, report the histogram of X along with the figures of

estimated densities.

3. (10pts) Generate N = 1000 Gaussian random data with µ1 = 5 and σ1 = 1 and another Gaussian

random data with µ2 = 0 and σ2 = 0.2. Test your function mykde on this data with h = {.1, 1, 5, 10}.

In your report, report the histogram of X along with the figures of estimated densities.

4. (10pts) Generate 2 sets of 2-D Gaussian random data with N1 = 500 and N2 = 500 using the following

parameters:

µ1 = [1, 0], µ2 = [0, 1.5], Σ1 =

?

0.9 0.4

0.4 0.9

?

, Σ2 =

?

0.9 0.4

0.4 0.9

?

. (2)

Test your function mykde on this data with h = {.1, 1, 5, 10}. In your report, report figures of estimated

densities.