Assignment #2 Implement deep Q learning




5/5 - (2 votes)

Assignment #2

In this assignment we will implement deep Q learning, following DeepMind’s paper ([mnih2015human]
and [mnih-atari-2013]) that learns to play Atari from raw pixels. The purpose is to understand the effectiveness of deep neural network as well as some of the techniques used in practice to stabilize training and
achieve better performance. You’ll also have to get comfortable with Tensorflow. We will train our networks
on the Pong-v0 environment from OpenAI gym, but the code can easily be applied to any other environment.
In Pong, one player wins if the ball passes by the other player. Winning a game gives a reward of 1, while
losing gives a negative reward of -1. An episode is over when one of the two players reaches 21 wins. Thus,
the final score is between -21 (lost episode) or +21 (won episode). Our agent plays against a decent hardcoded AI player. Average human performance is −3 (reported in [mnih-atari-2013]). If you go to the end
of the homework successfully, you will train an AI agent with super-human performance, reaching at least
+10 (hopefully more!).
1 Test Environment (5 pts)
Before running our code on Pong, it is crucial to test our code on a test environment. You should be able
to run your models on CPU in no more than a few minutes on the following environment:
• 4 states: 0, 1, 2, 3
• 5 actions: 0, 1, 2, 3, 4. Action 0 ≤ i ≤ 3 goes to state i, while action 4 makes the agent stay in the same
• Rewards: Going to state i from states 0, 1, and 3 gives a reward R(i), where R(0) = 0.1, R(1) =
−0.2, R(2) = 0, R(3) = −0.1. If we start in state 2, then the rewards defind above are multiplied by
−10. See Table 1 for the full transition and reward structure.
CS 234 Winter 2019: Assignment #2
• One episode lasts 5 time steps (for a total of 5 actions) and always starts in state 0 (no rewards at the
initial state).
State (s) Action (a) Next State (s’) Reward (R)
0 0 0 0.1
0 1 1 -0.2
0 2 2 0.0
0 3 3 -0.1
0 4 0 0.1
1 0 0 0.1
1 1 1 -0.2
1 2 2 0.0
1 3 3 -0.1
1 4 1 -0.2
2 0 0 -1.0
2 1 1 2.0
2 2 2 0.0
2 3 3 1.0
2 4 2 0.0
3 0 0 0.1
3 1 1 -0.2
3 2 2 0.0
3 3 3 -0.1
3 4 3 -0.1
Table 1: Transition table for the Test Environment
An example of a path (or an episode) in the test environment is shown in Figure 1, and the trajectory can be
represented in terms of st, at, Rt as: s0 = 0, a0 = 1, R0 = −0.2, s1 = 1, a1 = 2, R1 = 0, s2 = 2, a2 = 4, R2 =
0, s3 = 2, a3 = 3, R3 = (−0.1) ∗ (−10) = 1, s4 = 3, a4 = 0, R4 = 0.1, s5 = 0.
Figure 1: Example of a path in the Test Environment
1. (written 5pts) What is the maximum sum of rewards that can be achieved in a single episode in the
test environment, assuming γ = 1?
Page 2 of 6
CS 234 Winter 2019: Assignment #2
2 Q-learning (12 pts)
Tabular setting In the tabular setting, we maintain a table Q(s, a) for each tuple state-action. Given an
experience sample (s, a, r, s0
), our update rule is
Q(s, a) = Q(s, a) + α
r + γ max
Q (s
, a0
) − Q (s, a)
, (1)
where α ∈ R is the learning rate, γ the discount factor.
Approximation setting Due to the scale of Atari environments, we cannot reasonably learn and store a Q
value for each state-action tuple. We will instead represent our Q values as a function ˆq(s, a, w) where w are
parameters of the function (typically a neural network’s weights and bias parameters). In this approximation
setting, our update rule becomes
w = w + α
r + γ max
, a0
, w) − qˆ(s, a, w)
∇wqˆ(s, a, w). (2)
In other words, we are try to minimize
L(w) = Es,a,r,s0
r + γ max
, a0
, w) − qˆ(s, a, w)
Target Network DeepMind’s paper [mnih2015human] [mnih-atari-2013] maintains two sets of parameters, w (to compute ˆq(s, a)) and w− (target network, to compute ˆq(s
, a0
)) such that our update rule
w = w + α
r + γ max

, a0
, w−
− qˆ(s, a, w)
∇wqˆ(s, a, w). (4)
The target network’s parameters are updated with the Q-network’s parameters occasionally and are kept
fixed between individual updates. Note that when computing the update, we don’t compute gradients with
respect to w− (these are considered fixed weights).
Replay Memory As we play, we store our transitions (s, a, r, s0
) in a buffer. Old examples are deleted as
we store new transitions. To update our parameters, we sample a minibatch from the buffer and perform a
stochastic gradient descent update.
?-Greedy Exploration Strategy During training, we use an ?-greedy strategy. DeepMind’s paper
[mnih2015human] [mnih-atari-2013] decreases ? from 1 to 0.1 during the first million steps. At test
time, the agent choses a random action with probability ?sof t = 0.05.
There are several things to be noted:
• In this assignment, we will update w every learning freq steps by using a minibatch of experiences sampled from the replay buffer.
• DeepMind’s deep Q network takes as input the state s and outputs a vector of size = number of actions.
In the Pong environment, we have 6 actions, thus ˆq(s, w) ∈ R
• The input of the deep Q network is the concatenation 4 consecutive steps, which results in an input
after preprocessing of shape (80 × 80 × 4).
We will now examine these assumptions and implement the epsilon-greedy strategy.
1. (written 3pts) What is one benefit of using experience replay?
Page 3 of
CS 234 Winter 2019: Assignment #2
2. (written 3pts) What is one benefit of the target network?
3. (written 3pts) What is one benefit of representing the Q function as ˆq(s, w) ∈ R
4. (coding 3pts) Implement the get action and update functions in q1 Test your
implementation by running python q1
3 Linear Approximation (26 pts)
1. (written 3pts) Show that Equations (1) and (2) from section 2 above are exactly the same when
qˆ(s, a, w) = wT x(s, a), where w ∈ R
|S||A| and x : S × A → R
such that
x(s, a)s
,a0 =
1 if s
0 = s, a0 = a
0 otherwise
for all (s, a) ∈ S×A, x(s, a) is a vector of length |S||A| where the element corresponding to s
0 ∈ S, a0 ∈ A
is 1 when s
0 = s, a0 = a and is 0 otherwise.
2. (written 3pts) Derive the gradient with regard to the value function parameter w ∈ R
n given
qˆ(s, a, w) = wT x(s, a) for any function x(s, a) 7→ x ∈ R
n and write the update rule for w.
3. (coding 15pts) We will now implement linear approximation in Tensorflow. This question will setup
the whole pipeline for the remiander of the assignment. You’ll need to implement the following functions
in q2 (pleasd read throughq2 :
• add placeholders op
• get q values op
• add update target op
• add loss op
• add optimizer op
Test your code by running python q2 locally on CPU. This will run linear approximation with Tensorflow on the test environment. Running this implementation should only take a
minute or two.
4. (written 5pts) Do you reach the optimal achievable reward on the test environment? Attach the plot
scores.png from the directory results/q2 linear to your writeup.
4 Implementing DeepMind’s DQN (15 pts)
1. (coding 10pts) Implement the deep Q-network as described in [mnih2015human] by implementing
get q values op in q3 The rest of the code inherits from what you wrote for linear
approximation. Test your implementation locally on CPU on the test environment by running
python q3 Running this implementation should only take a minute or two.
2. (written 5pts) Attach the plot of scores, scores.png, from the directory results/q3 nature to
your writeup. Compare this model with linear approximation. How do the final performances compare?
How about the training time?
Page 4 of 6
CS 234 Winter 2019: Assignment #2
5 DQN on Atari (27 pts)
The Atari environment from OpenAI gym returns observations (or original frames) of size (210×160×3), the
last dimension corresponds to the RGB channels filled with values between 0 and 255 (uint8). Following
DeepMind’s paper [mnih2015human], we will apply some preprocessing to the observations:
• Single frame encoding: To encode a single frame, we take the maximum value for each pixel color
value over the frame being encoded and the previous frame. In other words, we return a pixel-wise
max-pooling of the last 2 observations.
• Dimensionality reduction: Convert the encoded frame to grey scale, and rescale it to (80 × 80 × 1).
(See Figure 2)
The above preprocessing is applied to the 4 most recent observations and these encoded frames are stacked
together to produce the input (of shape (80×80×4)) to the Q-function. Also, for each time we decide on an
action, we perform that action for 4 time steps. This reduces the frequency of decisions without impacting
the performance too much and enables us to play 4 times as many games while training. You can refer to
the Methods Section of [mnih2015human] for more details.
(a) Original input (210 × 160 × 3) with RGB colors (b) After preprocessing in grey scale of shape (80 × 80 × 1)
Figure 2: Pong-v0 environment
1. (written 2pts) Why do we use the last 4 time steps as input to the network for playing Atari games?
2. (written 5pts) What’s the number of parameters of the DQN model (for Pong) if the input to the
Q-network is a tensor of shape (80, 80, 4) and we use ”SAME” padding? How many parameters are
required for the linear Q-network, assuming the input is still of shape (80, 80, 4)? How do the number
of parameters compare between the two models?
3. (coding and written 5pts). Now, we’re ready to train on the Atari Pong-v0 environment. First,
launch linear approximation on pong with python q4 train atari on Azure’s GPU.
This will train the model for 500,000 steps and should take approximately an hour. What do you notice
about the performance?
4. (coding and written 10 pts). In this question, we’ll train the agent with DeepMind’s architecture on
the Atari Pong-v0 environment. Run python q5 train atari on Azure’s GPU.
This will train the model for 5 million steps and should take around 12 hours. Attach the plot
scores.png from the directory results/q5 train atari nature to your writeup. You should
get a score of around 13-15 after 5 million time steps. As stated previously, the Deepmind paper claims
average human performance is −3.
As the training time is roughly 12 hours, you may want to check after a few epochs that your network
is making progress. The following are some training tips:
• If you terminate your terminal session, the training will stop. In order to avoid this, you should
use screen to run your training in the background.
Page 5 of 6
CS 234 Winter 2019: Assignment #2
• The evaluation score printed on terminal should start at -21 and increase.
• The max of the q values should also be increasing
• The standard deviation of q shouldn’t be too small. Otherwise it means that all states have
similar q values
• You may want to use Tensorboard to track the history of the printed metrics. You can monitor
your training with Tensorboard by typing the command tensorboard –logdir=results
and then connecting to ip-of-you-machine:6006. Below are our Tensorboard graphs from
one training session:
5. (written 5pts) Compare the performance of the DeepMind architecure with the linear Q-network
approximation. How can you explain the gap in performance?
6 Real world RL with neural networks (10 pts)
Given a stream of batches of n environment interactions (si
, ai
, ri
, s0
) we want to learn the optimal value
function using a neural network. The underlying MDP has a finite sized action space.
1. (written 4pts) Your friend first suggests the following approach
(a) Initialize parameters φ of neural network Vφ
(b) For each batch of k tuples (si
, ai
, ri
, s0
) do Stochastic Gradient Descent with loss function Pk
i=0 |yi−
2 where yi = maxai
[ri + γVφ(s
What is the problem with this approach? (Hint: Think about the type of data we have)
2. (written 3pts) Your friend now suggests the following
(a) Initialize parameters φ of neural network for state-action value function Qφ(s, a)
(b) For each batch of k tuples (si
, ai
, ri
, s0
) do Stochastic Gradient Descent with loss function Pk
i=0 |yi−
, ai)|
2 where yi = ri + γV (s
Now as we just have the network Qφ(s, a) how would you determine V (s) needed for the above training
3. (written 3pts) Is the above method of learning the Q network guaranteed to give us an approximation
of the optimal state action value function?
Page 6 of 6