## Description

Assignment 4: Solving Markov Decision Processes

Artificial Intelligence

In this assignment, you will implement methods to solve a Markov Decision Process (MDP) for

an optimal policy.

In the textbook [AIMA 3e], Markov Decision Processes are defined in Section 17.1, and Section

17.2 describes the Value Iteration approach to solving an MDP. Read these two sections very

carefully. Section 17.3 describes the Policy Iteration approach, which is related, but is not used in

this assignment.

If you have access to Sutton and Barto’s Reinforcement Learning (1998), chapters 3 and 4

describe essentially the same material from a di↵erent perspective. Reading this would also be

helpful, and helps you understand how it fits into the reinforcement learning picture.

There will be no provided code: you must write and debug your own code for this assignment,

in either c++ or python. However if you need ideas, you may consult Java code from the AIMA

code repository (http://aima.cs.berkeley.edu/code.html).

Assignment

1. Explore the impact of “cost of living” R(s). This is based on Exercise 17.5 in the textbook.

Implement an MDP model of the environment shown in Figure 17.1 (the discount factor is

1). Implement a solver that uses value iteration to find the optimal policy, given a value

for R(s) < 0. Implement a binary search to identify the eight negative threshold values for

R(s) at which the optimal policy changes (the ninth threshold is zero). In the output, print

only 4 decimal places of accuracy, i.e., your solution should be within ±0.0001 of the correct

value. Find the nine policies associated with the nine negative intervals between thresholds.

Note 1: Some of the values listed in Figure 17.2 and in the accompanying text

may be incorrect in the least significant decimal places.

Note 2: The states with +1 and -1 reward are terminal states. The utility of these

states is the same as their reward, and should not be updated with value iteration.

Note 3: Figure 17.4 contains pseudocode for value iteration. Since in our case, =

1, we can’t use their convergence test. Instead, use < 0.0000001 for convergence,

where is the maximum change in utility. Value iteration converges rapidly, and

this should take no more than 30 iterations.

As an example, here’s a display of the policy for R(s) = 0.04, laying the actions in a 4×3

array, corresponding to the states in the environment (use the character ‘V’ for a down action,

and make sure to have spaces between all the states in a row).

1

^ X ^ -1

^<<<

1

Question 1 Why does the optimal policy for R(s) = 0.04 say to move left from state (3,

1), even though the utility for the state (3, 2) is higher than the state (2, 1) in Figure 17.3?

Show your calculations.

2. In the same environment as problem 1, explore the di↵erence between expected reward (average over many runs) and actual reward (on a particular run), given that the result of an

action is uncertain.

For the case of R(s) = 0.04, find and display the utilities for all the states to three decimal

places. Implement a Monte Carlo simulation in which an agent starts at state (4, 1) and

follows the optimal policy until it reaches a terminal state. Create three histograms of the

final rewards earned by the agent for 10, 100, and 1000 runs, and compute the means and

standard deviations of these runs. The final reward of a run is just the sum of the rewards of

the states the agent is in at each time step.

Question 2 Compare these distributions with the expected utility of the initial state. Are

the means similar to the expected utility, and to each other? How large a role does chance

play? How spread out are the rewards in each distribution? Can you account for the shapes

of the distributions?

3. Explore the impact of “discount rate” . This is an extension of Exercise 17.8 in the textbook.

Implement an MDP model of the environment shown in Figure 17.14(a) and described in

Exercise 17.8. Use a (non-terminal) reward of r = +3 for the upper left square. Use your value

iteration solver to find the optimal policy, given a value for the discount rate . Implement a

binary search to identify (out to five decimal places accuracy) five (5) threshold values for at

which the optimal policy changes. Find the optimal policy for each of these six (6) intervals

of values for 0 < 1.

Question 3 Intuitively, what does the discount rate mean, and why does changing it change

the solution in this problem?

What to submit:

A zipped directory named “A4-unique”, containing a solution pdf named unique-p4solution.pdf and

a subdirectory named code. Replace “unique” with your own uniquename. For example, since my

uniquename is rkruser, I would name the directory A4-rkruser and the pdf rkruser-p4solution.pdf.

The code subdirectory contains the source code for your program, which must compile and run

in CAEN when the graders issue the following commands, after unzipping your file and changing

to your code directory:

*** For C++***

g++ -std=c++11 -o main *.cpp

./main

*** For python ***

python main.py

Your program should take no arguments and automatically generate all of the following text files

in a subdirectory of code named generated.

2

• For problem 1, generate a file P1-output.txt that contains the threshold values for R(s), each

on its own line, starting with zero (0), and followed in descending order by the eight (8)

negative threshold values you have found. Between each pair of values, display the 4×3 array

describing the optimal policy for that interval. Put an asterisk (*) after any action that

changed from the previous policy.

• For problem 2, generate the file P2-output.txt, which contains the utilities for each state,

in a 3⇥4 grid like the one you used to print the optimal policies (print 0 for the utility of

the obstacle at (2,2)). Also generate 3 text files containing data for making the histograms,

formatted as lists of numbers separated by newlines: P2-data-10.txt, P2-data-100.txt, and

P2-data-1000.txt.

• For problem 3, generate a text file P3-output.txt with the threshold values for the discount

rate, , separating each pair of adjacent threshold values with a 3×3 text display of the optimal

policy for that interval, and a 3×3 text display of the utilities of the states in that interval

(computed with equal to the lower threshold defining the interval), accurate to 3 decimal

places.

Your pdf should be formatted as follows:

• Page 1: Your name, and the answer to question 1.

• Page 2: The answer to question 2. Include in your answer the means and standard deviations

of the histograms.

• Page 3: The answer to question 3.

• Page 4: The full output from P1-output.txt.

• Pages 5 and 6: The full output from P2-output.txt, and the three histograms you created. (If

this only takes 1 page, leave page 6 blank).

• Pages 7: The full output from P3-output.txt.

The reason we ask for this format is that it is convenient to use with gradescope. Please stick

to the given page numbers. Also, label each di↵erent item on pages 4-7 with a figure number and

a title (for example, the output to P1-output.txt could be labelled “Figure 1: Problem 1 Output”).

Grading

• 10 points for correct files and directory structure, and for code compiling and running without

error in CAEN.

• Problem 1: 20 points for correct R(s) thresholds and policies in P1-output.txt (about two

points per threshold/policy). 10 points for answering question 1.

• Problem 2: 15 points for the histograms (5 for each) and 5 points for the correct table of

utilities in P2-output.txt. 10 points for answering all parts of question 2.

• Problem 3: 20 points for the correct thresholds, policies, and utilities in P3-output.txt. 10

points for answering question 3.

3