## Description

Problem Set 6

EECS 492:

Machine Learning

Note that while this assignment has six tasks, several of the tasks are smaller or more straightforward than previous tasks. Start working early and you shouldn’t have any issues completing this

assignment by the deadline.

Task 1: Decision Trees (20 points)

You are given a set of 8 training examples from a cupcake data set and are asked to construct a

decision tree to classify them. The attributes are whether the cupcake has filling, is chocolate, has

sprinkles, and is wheat. The label is a binary Yes or No as to whether Luke likes this cupcake. The

following table shows each example along with its corresponding class and the values of the four

abbreviated attributes.

example Fill Choc Spr Wheat Likes

x1 True False True True Yes

x2 False False False False No

x3 False True False True Yes

x4 False True True True Yes

x5 True False True False No

x6 False False True False Yes

x7 True False False True No

x8 True False False False Yes

1. Using the decision tree learning algorithm determining splits based on information gain, build

a new decision tree for this data. Show all information gain calculations and draw a final

(learned) decision tree. Break ties in information gain according to the order of the attributes

in the table, preferring earlier columns in the table.

2. Give a worst case bound on the number of nodes in the decision tree learned from the method

from part (1) from an arbitrarily large training set in which each example has k features

(including class), and b different values for each feature. Allow nodes to branch into b branches.

[Assume nodes include both internal nodes and leaf nodes, where leaf nodes indicate the

classification.].

1

3. Beyond the initial training examples, you are given a new set of examples. Consider the new

set of examples for this problem.

example Fill Choc Spr Wheat Likes

x9 False True False False Yes

x10 False True True True No

Does the decision tree you created correctly classify these examples?

4. (a) Does every decision tree that is consistent with a data set need to implement the same

classification function? If so briefly state why; if not provide a counterexample.

(b) Similarly, is it possible for two decision trees to implement the same function and yet

have a different tree structure? If so, provide an example, if not briefly explain why.

(c) Consider a set of T decision trees such that each tree t in the set achieves 100% classification accuracy on a set of training examples S. What fraction of all possible different

examples needs to be in S to guarantee that all of the trees in T implement the same

classification function

Task 2: Ensemble Learning (15 points)

1. Consider ensemble learning on the problem in Task 1, using the AdaBoost algorithm as given

in R&N (page: 751). Assume, as in the book, that our learning algorithm is limited to only

learning “decision stumps” – which are decision trees that split on only one attribute and each

resulting leaf must be labeled with a class. Your ensemble learner should use the maximum

information gain using the weighted examples, breaking ties by the order of the attributes in

the table.

Use training examples x1 through x8 in the Task 1 table and answer the following questions.

(Note: use natural log for calculating your answers.)

(a) What are the weights used for the training examples to do the first iteration of the

algorithm?

(b) Generate the initial decision stump (for k=1) given the AdaBoost algorithm. What

attribute is chosen for splitting?

(c) What is the error for this decision stump hypothesis? Enumerate the incorrectly classified

examples.

(d) What is the weight (z) of this hypothesis?

(e) What classification would this 1-hypothesis ensemble give (using weighted-majority and

random tie-breaking) to the test examples from Task 1(x9 and x10). What is the success

rate?

(f) What are the weights that should be assigned to each of the training examples for the

next iteration (Note: NOT x9 and x10)?

(g) Now generate the second decision stump (for k=2) given the AdaBoost algorithm. What

attribute is chosen for splitting now?

2

(h) What is the error for this decision stump? Again, enumerate the incorrectly classified

examples.

(i) What is the weight (z) of this second hypothesis?

(j) What classification would the resulting 2-hypothesis ensemble (using weighted-majority

and random tie-breaking) give to the test cases x9 and x10?

Task 3: PAC-Learning (10 points)

Now, we will briefly investigate the principles of Probably Approximately Correct (PAC) learning

for a Boolean classification problem. Let this problem have two binary attributes, one trinary

attribute, and one 4-ary attribute. Assume a hypothesis is “bad” (i.e. not approximately correct) if

it classifies more than two examples wrong. For the problems below, you may assume each example

is equally likely to be seen.

1. What is the hypothesis space for this problem? How many hypotheses are in the initial

hypothesis space (this is the hypothesis space before any examples are seen)? You may leave

your answer in exponential form.

2. How many of the hypotheses in the initial hypothesis space are not bad (i.e. approximately

correct)?

3. What is the probability that a randomly-chosen hypothesis from the initial hypothesis space

will be approximately correct?

4. What value of ? corresponds to the definition of approximately correct for this problem,

assuming examples are all equally likely to be encountered?

5. What is the upper bound on the probability a bad hypothesis will classify the first 12 training

examples correctly for this problem?

6. After 40 (distinct) examples are observed, how large is the remaining hypothesis space?

7. What is the probability that a randomly-chosen hypothesis that is consistent with the 40

(distinct) examples seen so far will be approximately correct? To better appreciate the benefit

of having seen these examples also answer the following question: How many times larger is

this probability than the probability in part c of choosing an approximately correct hypothesis

without having seen any examples?

Task 4: Artificial Neural Networks (15 points)

This problem asks you to design artificial neural networks that will result in desired logical functions.

Assume that all of the units are perceptrons, and that the bias input to all units is -1. When

specifying weights, you should use integers for all weights except for the bias weights which should

be integers + 0.5 (e.g. . . . -1.5, -0.5, 0.5, 1.5, . . . ). All of the units use a threshold function as the

activation function.

The possible values for each input are 0 and 1, where 0 corresponds to false and 1 corresponds to

3

true.

A

B

C

WAC

WBC

WbiasC

1. For the above network, find weights (WA,C, WB,C, Wbias,C) that implement the function A

NOR B. That is, C will be activated if and only if the expression A NOR B is true for the

values of boolean inputs A and B.

2. Now consider all Boolean functions of 2 variables. Which of these functions cannot be implemented by a single-layer perceptron network? Briefly justify your answer. (Hint: This

shouldn’t take a lot of work!)

3. Choose one from the functions you identified in part (b.) that couldn’t be implemented by

a single-layer perceptron, AND was not identified in class and use a multi-layer perceptron

with one hidden layer of two nodes to implement it (using the weight conventions mentioned

above). Please indicate which function you are implementing!

4. Consider a perceptron learning rule that contains a learning-rate parameter (see equation 18.7

in RN). Give only one or two sentences for each of these questions.

(a) What is an advantage of a large learning rate?

(b) What is an advantage of a small learning rate?

Task 5: Naïve Bayes (20 points)

Suppose that you are in charge of sorting desserts before they are given to Luke. Luke is terrible at

identifying pastries, so you need to make an automatic identification method. Because we need to

implement it quickly, the agent will determine which type of dessert it is based on a Naïve Bayes

Classification.

1. We need to determine the prior probability of each classification, p(Ck). The baker tells us that

there are these, and only these, classifications of pastries (and their classification numbers) in

the order: 20 Danishes(k=1), 27 Macaroons(k=2), 31 Strudels(k=3), and 22 Cannolis(k=4).

From this information, what are the prior probabilities of each of the 4 classifications?

2. Suppose that Luke bursts through the door, and demands his classified desserts immediately,

so you can use only the most simple model, consisting of just the “Class” node that can take on

4

any of the 4 classification values (1 through 4). For each of the test examples below, how would

the agent using this very simple (one node!) Bayesian Network to classify the example? (If

any ties occur, break in class order, 1 2 3 4.) What is the accuracy of the classification

(what fraction of the test cases are correctly classified)?

Sample Sweetness Filled Baker Class

x1 Very Yes Ryen 1

x2 Moderately Yes Ryen 4

x3 Extremely Yes Mark 3

x4 Extremely No Spruce 1

x5 Extremely Yes Spruce 3

x6 Very Yes Spruce 3

x7 Moderately No Mark 2

Luke wasn’t particularly happy with your predictions from the last part, but he decided to give

you one more chance for he is a just and merciful TA. Now assume we have time to find better

estimates for the likelihoods by creating the complete Naïve Bayes model including CPTs.

Using earlier observations, a friendly math loving baker has already identified the CPT entries for the attributes (Sweetness(S), Baker(B), Filled(F)) when the classification is Danishes

or Macaroons, and these are given below. If a entry is not given, assume it has zero probability.

p(S=Extremely|C1)=0.4, p(S=Very|C1)=0.4, p(S=Moderately|C1)=0.2

p(F=Yes|C1)=0.4, p(F=No|C1)=0.6

p(B=Ryen|C1)=0.3, p(B=Spruce|C1)=0.5 p(B=Mark|C1)=0.2

p(S=Extremely|C2)=0.2, p(S=Very|C2)=0.2 p(S=Moderately|C2)=0.6

p(F=Yes|C2)=0.2, p(F=No|C2)=0.8

p(B=Ryen|C2)=0.2, p(B=Spruce|C2)=0.2 p(B=Mark|C2)=0.6

3. Now use the following classified examples to determine the likelihoods for Strudels and Cannolis in the same form as in the previous part.

Sample Sweetness Filled Baker Class

x10 Moderately No Spruce 3

x11 Extremely Yes Mark 3

x12 Very Yes Spruce 3

x13 Very Yes Spruce 3

x14 Extremely Yes Spruce 3

x15 Very Yes Mark 3

x16 Extremely No Ryen 3

x17 Extremely Yes Spruce 3

x18 Extremely Yes Mark 3

x19 Moderately No Spruce 3

5

Sample Sweetness Filled Baker Class

x20 Very Yes Ryen 4

x21 Moderately Yes Spruce 4

x22 Very No Ryen 4

x23 Moderately Yes Ryen 4

x24 Moderately Yes Mark 4

x25 Very Yes Spruce 4

x26 Moderately Yes Ryen 4

x27 Very Yes Mark 4

4. Now we have determined estimates for the priors and likelihoods, and are ready to make our

Naive Bayes predictions. Draw the Naive Bayes network and use your model to classify each

of the pastries from part 2. Will Luke be happier with the classifications this time around?

Task 6: Q Learning (20 points)

Our reinforcement learning agent will use exploration/exploitation, using a simple parameter to

select which to do. When exploration is chosen, a random action is taken no matter what is known,

allowing the agent to learn. When exploitation is chosen, the agent makes the best decision based

on what it has learned, taking the action with the highest Q value in that state, enabling the agent

to use the information it has learned to maximize reward at the cost of potentially gaining more

information.

You will implement an agent that learns the Q values by interacting with an environment from

the previous homework assignment. One small difference is set PR = 0.8, so that the rover has only

a 20 percent chance of breaking on rocks. Our Q-Learning agent however knows substantially less

than our agent from the last homework. It has no idea the layout of the environment, or any other

information other than the reward it just received, the current state, and what it has learned. We

also will let it know the total number of states (17), numbered 0-16, although this is not necessarily required in order for the agent to learn the environment. Note that state 16 is the BROKEN state.

There are two key components of the program you will be writing. Firstly a simulator that

essentially takes in an action and current state and outputs the resulting state (in order to account for the randomness of breaking). You will also have an agent which implements the following

pseudocode, but needs to ask the simulator function what its actions actually result in. Using a

programming language of your choice, implement the Q-Learning algorithm according to the following pseudocode:

Initialize learning rate alpha = 0.05

Initialize discount parameter gamma = 0.95

Initialize explorationchance = 1.

Initialize Q values to 0. This should include a Q value for each action-state pair.

Remember that all 5 actions are available in each state.

Initialize trials = 0 and initialize iterations = 0

Initialize action = “NONE”

Initialize learning = True

6

While learning:

Read newstate and reward (which can be garbage if first round)

if action in [“N”,”S”,”E”,”W”,”R”]:

Find the Q value for state and action, called q here.

Find the Q value for newstate with its action with the maximal Q value, q’

Update q according to: q = q + alpha * (reward + (gamma * q’) – q)

set state = newstate

if there have been 1000 trials:

set learning = False

choose exploration or exploitation according to explorationchance

if exploitation:

set action as action with the highest Q value for newstate

if exploration:

set action to one of “N”, “S”, “E”, “W”, or “P” randomly.

if not learning:

set action = “DONE_SIGNAL”

elif newstate received was Terminal (simulator sent “16”):

set action = “RESTART_SIGNAL”

set action = “NONE”

set iterations = 0

increment trials

explorationchance = trials−0.1 FOR PART 1

elif iterations = 15 (the time horizon for each trial)

set action = “RESTART_SIGNAL”

set action = “NONE”

set iterations = 0

increment trials

explorationchance = trials−0.1 FOR PART 1

increment iterations

commit to the action, compute information on result and reward

output Q values learned. You may also automatically find policies, etc.

1. After implementing the above algorithm report the learned policy action for each non-terminal

state. Calculate the state number of a state (x,y) by x+4y (e.g. the robot in position (1,1)

would be in state number 5, using the notation from hw5). Also report the reward obtained

by your agent in the last 10 trials it runs, and the average over all trials. Pick a random nonterminal starting location for each trial. A new trial also starts whenever “RESTART_SIGNAL”

is sent.

2. Replace both explorationchance updates (marked with FOR PART 1 in the pseudocode)

with a new update rule (comment out the part 1 rule for now, you’ll want it later) which causes

your agent to explore too often. Your update should be in the form of explorationchance

= trials−x. Report the x you used. Report the learned policy action and its Q-value for

each non-terminated state, and the rewards obtained in each of the last 10 trials and average

reward. Consider these results and the value iteration policy below and justify why you believe

it is using too much exploration and too little exploitation.

7

3. Now replace both explorationchance updates (comment out the part 1 and 2 rules so we

can still see it in your submission) which causes your agent to exploit too often in the same

format as part 2. Report the x you used. Report the learned policy action and its Q-value for

each non-terminated state, and the rewards obtained in each of the last 10 trials and average

reward. Consider these results and the value iteration policy below and justify why you believe

it is using too much exploitation and too little exploration.

8