Assignment 4: Solving Markov Decision Processes




Rate this product

Assignment 4: Solving Markov Decision Processes
Artificial Intelligence

In this assignment, you will implement methods to solve a Markov Decision Process (MDP) for
an optimal policy.
In the textbook [AIMA 3e], Markov Decision Processes are defined in Section 17.1, and Section
17.2 describes the Value Iteration approach to solving an MDP. Read these two sections very
carefully. Section 17.3 describes the Policy Iteration approach, which is related, but is not used in
this assignment.
If you have access to Sutton and Barto’s Reinforcement Learning (1998), chapters 3 and 4
describe essentially the same material from a di↵erent perspective. Reading this would also be
helpful, and helps you understand how it fits into the reinforcement learning picture.
There will be no provided code: you must write and debug your own code for this assignment,
in either c++ or python. However if you need ideas, you may consult Java code from the AIMA
code repository (
1. Explore the impact of “cost of living” R(s). This is based on Exercise 17.5 in the textbook.
Implement an MDP model of the environment shown in Figure 17.1 (the discount factor is
1). Implement a solver that uses value iteration to find the optimal policy, given a value
for R(s) < 0. Implement a binary search to identify the eight negative threshold values for
R(s) at which the optimal policy changes (the ninth threshold is zero). In the output, print
only 4 decimal places of accuracy, i.e., your solution should be within ±0.0001 of the correct
value. Find the nine policies associated with the nine negative intervals between thresholds.
Note 1: Some of the values listed in Figure 17.2 and in the accompanying text
may be incorrect in the least significant decimal places.
Note 2: The states with +1 and -1 reward are terminal states. The utility of these
states is the same as their reward, and should not be updated with value iteration.
Note 3: Figure 17.4 contains pseudocode for value iteration. Since in our case, =
1, we can’t use their convergence test. Instead, use < 0.0000001 for convergence,
where is the maximum change in utility. Value iteration converges rapidly, and
this should take no more than 30 iterations.
As an example, here’s a display of the policy for R(s) = 0.04, laying the actions in a 4×3
array, corresponding to the states in the environment (use the character ‘V’ for a down action,
and make sure to have spaces between all the states in a row).
^ X ^ -1
Question 1 Why does the optimal policy for R(s) = 0.04 say to move left from state (3,
1), even though the utility for the state (3, 2) is higher than the state (2, 1) in Figure 17.3?
Show your calculations.
2. In the same environment as problem 1, explore the di↵erence between expected reward (average over many runs) and actual reward (on a particular run), given that the result of an
action is uncertain.
For the case of R(s) = 0.04, find and display the utilities for all the states to three decimal
places. Implement a Monte Carlo simulation in which an agent starts at state (4, 1) and
follows the optimal policy until it reaches a terminal state. Create three histograms of the
final rewards earned by the agent for 10, 100, and 1000 runs, and compute the means and
standard deviations of these runs. The final reward of a run is just the sum of the rewards of
the states the agent is in at each time step.
Question 2 Compare these distributions with the expected utility of the initial state. Are
the means similar to the expected utility, and to each other? How large a role does chance
play? How spread out are the rewards in each distribution? Can you account for the shapes
of the distributions?
3. Explore the impact of “discount rate” . This is an extension of Exercise 17.8 in the textbook.
Implement an MDP model of the environment shown in Figure 17.14(a) and described in
Exercise 17.8. Use a (non-terminal) reward of r = +3 for the upper left square. Use your value
iteration solver to find the optimal policy, given a value for the discount rate . Implement a
binary search to identify (out to five decimal places accuracy) five (5) threshold values for at
which the optimal policy changes. Find the optimal policy for each of these six (6) intervals
of values for 0  < 1.
Question 3 Intuitively, what does the discount rate mean, and why does changing it change
the solution in this problem?
What to submit:
A zipped directory named “A4-unique”, containing a solution pdf named unique-p4solution.pdf and
a subdirectory named code. Replace “unique” with your own uniquename. For example, since my
uniquename is rkruser, I would name the directory A4-rkruser and the pdf rkruser-p4solution.pdf.
The code subdirectory contains the source code for your program, which must compile and run
in CAEN when the graders issue the following commands, after unzipping your file and changing
to your code directory:
*** For C++***
g++ -std=c++11 -o main *.cpp
*** For python ***
Your program should take no arguments and automatically generate all of the following text files
in a subdirectory of code named generated.
• For problem 1, generate a file P1-output.txt that contains the threshold values for R(s), each
on its own line, starting with zero (0), and followed in descending order by the eight (8)
negative threshold values you have found. Between each pair of values, display the 4×3 array
describing the optimal policy for that interval. Put an asterisk (*) after any action that
changed from the previous policy.
• For problem 2, generate the file P2-output.txt, which contains the utilities for each state,
in a 3⇥4 grid like the one you used to print the optimal policies (print 0 for the utility of
the obstacle at (2,2)). Also generate 3 text files containing data for making the histograms,
formatted as lists of numbers separated by newlines: P2-data-10.txt, P2-data-100.txt, and
• For problem 3, generate a text file P3-output.txt with the threshold values for the discount
rate, , separating each pair of adjacent threshold values with a 3×3 text display of the optimal
policy for that interval, and a 3×3 text display of the utilities of the states in that interval
(computed with equal to the lower threshold defining the interval), accurate to 3 decimal
Your pdf should be formatted as follows:
• Page 1: Your name, and the answer to question 1.
• Page 2: The answer to question 2. Include in your answer the means and standard deviations
of the histograms.
• Page 3: The answer to question 3.
• Page 4: The full output from P1-output.txt.
• Pages 5 and 6: The full output from P2-output.txt, and the three histograms you created. (If
this only takes 1 page, leave page 6 blank).
• Pages 7: The full output from P3-output.txt.
The reason we ask for this format is that it is convenient to use with gradescope. Please stick
to the given page numbers. Also, label each di↵erent item on pages 4-7 with a figure number and
a title (for example, the output to P1-output.txt could be labelled “Figure 1: Problem 1 Output”).
• 10 points for correct files and directory structure, and for code compiling and running without
error in CAEN.
• Problem 1: 20 points for correct R(s) thresholds and policies in P1-output.txt (about two
points per threshold/policy). 10 points for answering question 1.
• Problem 2: 15 points for the histograms (5 for each) and 5 points for the correct table of
utilities in P2-output.txt. 10 points for answering all parts of question 2.
• Problem 3: 20 points for the correct thresholds, policies, and utilities in P3-output.txt. 10
points for answering question 3.