## Description

Assignment #2

Introduction

In this assignment we will implement deep Q learning, following DeepMind’s paper ([mnih2015human]

and [mnih-atari-2013]) that learns to play Atari from raw pixels. The purpose is to understand the effectiveness of deep neural network as well as some of the techniques used in practice to stabilize training and

achieve better performance. You’ll also have to get comfortable with Tensorflow. We will train our networks

on the Pong-v0 environment from OpenAI gym, but the code can easily be applied to any other environment.

In Pong, one player wins if the ball passes by the other player. Winning a game gives a reward of 1, while

losing gives a negative reward of -1. An episode is over when one of the two players reaches 21 wins. Thus,

the final score is between -21 (lost episode) or +21 (won episode). Our agent plays against a decent hardcoded AI player. Average human performance is −3 (reported in [mnih-atari-2013]). If you go to the end

of the homework successfully, you will train an AI agent with super-human performance, reaching at least

+10 (hopefully more!).

1 Test Environment (5 pts)

Before running our code on Pong, it is crucial to test our code on a test environment. You should be able

to run your models on CPU in no more than a few minutes on the following environment:

• 4 states: 0, 1, 2, 3

• 5 actions: 0, 1, 2, 3, 4. Action 0 ≤ i ≤ 3 goes to state i, while action 4 makes the agent stay in the same

state.

• Rewards: Going to state i from states 0, 1, and 3 gives a reward R(i), where R(0) = 0.1, R(1) =

−0.2, R(2) = 0, R(3) = −0.1. If we start in state 2, then the rewards defind above are multiplied by

−10. See Table 1 for the full transition and reward structure.

1

CS 234 Winter 2019: Assignment #2

• One episode lasts 5 time steps (for a total of 5 actions) and always starts in state 0 (no rewards at the

initial state).

State (s) Action (a) Next State (s’) Reward (R)

0 0 0 0.1

0 1 1 -0.2

0 2 2 0.0

0 3 3 -0.1

0 4 0 0.1

1 0 0 0.1

1 1 1 -0.2

1 2 2 0.0

1 3 3 -0.1

1 4 1 -0.2

2 0 0 -1.0

2 1 1 2.0

2 2 2 0.0

2 3 3 1.0

2 4 2 0.0

3 0 0 0.1

3 1 1 -0.2

3 2 2 0.0

3 3 3 -0.1

3 4 3 -0.1

Table 1: Transition table for the Test Environment

An example of a path (or an episode) in the test environment is shown in Figure 1, and the trajectory can be

represented in terms of st, at, Rt as: s0 = 0, a0 = 1, R0 = −0.2, s1 = 1, a1 = 2, R1 = 0, s2 = 2, a2 = 4, R2 =

0, s3 = 2, a3 = 3, R3 = (−0.1) ∗ (−10) = 1, s4 = 3, a4 = 0, R4 = 0.1, s5 = 0.

Figure 1: Example of a path in the Test Environment

1. (written 5pts) What is the maximum sum of rewards that can be achieved in a single episode in the

test environment, assuming γ = 1?

Page 2 of 6

CS 234 Winter 2019: Assignment #2

2 Q-learning (12 pts)

Tabular setting In the tabular setting, we maintain a table Q(s, a) for each tuple state-action. Given an

experience sample (s, a, r, s0

), our update rule is

Q(s, a) = Q(s, a) + α

?

r + γ max

a0∈A

Q (s

0

, a0

) − Q (s, a)

?

, (1)

where α ∈ R is the learning rate, γ the discount factor.

Approximation setting Due to the scale of Atari environments, we cannot reasonably learn and store a Q

value for each state-action tuple. We will instead represent our Q values as a function ˆq(s, a, w) where w are

parameters of the function (typically a neural network’s weights and bias parameters). In this approximation

setting, our update rule becomes

w = w + α

?

r + γ max

a0∈A

qˆ(s

0

, a0

, w) − qˆ(s, a, w)

?

∇wqˆ(s, a, w). (2)

In other words, we are try to minimize

L(w) = Es,a,r,s0

?

r + γ max

a0∈A

qˆ(s

0

, a0

, w) − qˆ(s, a, w)

?2

(3)

Target Network DeepMind’s paper [mnih2015human] [mnih-atari-2013] maintains two sets of parameters, w (to compute ˆq(s, a)) and w− (target network, to compute ˆq(s

0

, a0

)) such that our update rule

becomes

w = w + α

?

r + γ max

a0∈A

qˆ

s

0

, a0

, w−

?

− qˆ(s, a, w)

?

∇wqˆ(s, a, w). (4)

The target network’s parameters are updated with the Q-network’s parameters occasionally and are kept

fixed between individual updates. Note that when computing the update, we don’t compute gradients with

respect to w− (these are considered fixed weights).

Replay Memory As we play, we store our transitions (s, a, r, s0

) in a buffer. Old examples are deleted as

we store new transitions. To update our parameters, we sample a minibatch from the buffer and perform a

stochastic gradient descent update.

?-Greedy Exploration Strategy During training, we use an ?-greedy strategy. DeepMind’s paper

[mnih2015human] [mnih-atari-2013] decreases ? from 1 to 0.1 during the first million steps. At test

time, the agent choses a random action with probability ?sof t = 0.05.

There are several things to be noted:

• In this assignment, we will update w every learning freq steps by using a minibatch of experiences sampled from the replay buffer.

• DeepMind’s deep Q network takes as input the state s and outputs a vector of size = number of actions.

In the Pong environment, we have 6 actions, thus ˆq(s, w) ∈ R

6

.

• The input of the deep Q network is the concatenation 4 consecutive steps, which results in an input

after preprocessing of shape (80 × 80 × 4).

We will now examine these assumptions and implement the epsilon-greedy strategy.

1. (written 3pts) What is one benefit of using experience replay?

Page 3 of

CS 234 Winter 2019: Assignment #2

2. (written 3pts) What is one benefit of the target network?

3. (written 3pts) What is one benefit of representing the Q function as ˆq(s, w) ∈ R

K

4. (coding 3pts) Implement the get action and update functions in q1 schedule.py. Test your

implementation by running python q1 schedule.py.

3 Linear Approximation (26 pts)

1. (written 3pts) Show that Equations (1) and (2) from section 2 above are exactly the same when

qˆ(s, a, w) = wT x(s, a), where w ∈ R

|S||A| and x : S × A → R

|S||A|

such that

x(s, a)s

0

,a0 =

?

1 if s

0 = s, a0 = a

0 otherwise

for all (s, a) ∈ S×A, x(s, a) is a vector of length |S||A| where the element corresponding to s

0 ∈ S, a0 ∈ A

is 1 when s

0 = s, a0 = a and is 0 otherwise.

2. (written 3pts) Derive the gradient with regard to the value function parameter w ∈ R

n given

qˆ(s, a, w) = wT x(s, a) for any function x(s, a) 7→ x ∈ R

n and write the update rule for w.

3. (coding 15pts) We will now implement linear approximation in Tensorflow. This question will setup

the whole pipeline for the remiander of the assignment. You’ll need to implement the following functions

in q2 linear.py (pleasd read throughq2 linear.py) :

• add placeholders op

• get q values op

• add update target op

• add loss op

• add optimizer op

Test your code by running python q2 linear.py locally on CPU. This will run linear approximation with Tensorflow on the test environment. Running this implementation should only take a

minute or two.

4. (written 5pts) Do you reach the optimal achievable reward on the test environment? Attach the plot

scores.png from the directory results/q2 linear to your writeup.

4 Implementing DeepMind’s DQN (15 pts)

1. (coding 10pts) Implement the deep Q-network as described in [mnih2015human] by implementing

get q values op in q3 nature.py. The rest of the code inherits from what you wrote for linear

approximation. Test your implementation locally on CPU on the test environment by running

python q3 nature.py. Running this implementation should only take a minute or two.

2. (written 5pts) Attach the plot of scores, scores.png, from the directory results/q3 nature to

your writeup. Compare this model with linear approximation. How do the final performances compare?

How about the training time?

Page 4 of 6

CS 234 Winter 2019: Assignment #2

5 DQN on Atari (27 pts)

The Atari environment from OpenAI gym returns observations (or original frames) of size (210×160×3), the

last dimension corresponds to the RGB channels filled with values between 0 and 255 (uint8). Following

DeepMind’s paper [mnih2015human], we will apply some preprocessing to the observations:

• Single frame encoding: To encode a single frame, we take the maximum value for each pixel color

value over the frame being encoded and the previous frame. In other words, we return a pixel-wise

max-pooling of the last 2 observations.

• Dimensionality reduction: Convert the encoded frame to grey scale, and rescale it to (80 × 80 × 1).

(See Figure 2)

The above preprocessing is applied to the 4 most recent observations and these encoded frames are stacked

together to produce the input (of shape (80×80×4)) to the Q-function. Also, for each time we decide on an

action, we perform that action for 4 time steps. This reduces the frequency of decisions without impacting

the performance too much and enables us to play 4 times as many games while training. You can refer to

the Methods Section of [mnih2015human] for more details.

(a) Original input (210 × 160 × 3) with RGB colors (b) After preprocessing in grey scale of shape (80 × 80 × 1)

Figure 2: Pong-v0 environment

1. (written 2pts) Why do we use the last 4 time steps as input to the network for playing Atari games?

2. (written 5pts) What’s the number of parameters of the DQN model (for Pong) if the input to the

Q-network is a tensor of shape (80, 80, 4) and we use ”SAME” padding? How many parameters are

required for the linear Q-network, assuming the input is still of shape (80, 80, 4)? How do the number

of parameters compare between the two models?

3. (coding and written 5pts). Now, we’re ready to train on the Atari Pong-v0 environment. First,

launch linear approximation on pong with python q4 train atari linear.py on Azure’s GPU.

This will train the model for 500,000 steps and should take approximately an hour. What do you notice

about the performance?

4. (coding and written 10 pts). In this question, we’ll train the agent with DeepMind’s architecture on

the Atari Pong-v0 environment. Run python q5 train atari nature.py on Azure’s GPU.

This will train the model for 5 million steps and should take around 12 hours. Attach the plot

scores.png from the directory results/q5 train atari nature to your writeup. You should

get a score of around 13-15 after 5 million time steps. As stated previously, the Deepmind paper claims

average human performance is −3.

As the training time is roughly 12 hours, you may want to check after a few epochs that your network

is making progress. The following are some training tips:

• If you terminate your terminal session, the training will stop. In order to avoid this, you should

use screen to run your training in the background.

Page 5 of 6

CS 234 Winter 2019: Assignment #2

• The evaluation score printed on terminal should start at -21 and increase.

• The max of the q values should also be increasing

• The standard deviation of q shouldn’t be too small. Otherwise it means that all states have

similar q values

• You may want to use Tensorboard to track the history of the printed metrics. You can monitor

your training with Tensorboard by typing the command tensorboard –logdir=results

and then connecting to ip-of-you-machine:6006. Below are our Tensorboard graphs from

one training session:

5. (written 5pts) Compare the performance of the DeepMind architecure with the linear Q-network

approximation. How can you explain the gap in performance?

6 Real world RL with neural networks (10 pts)

Given a stream of batches of n environment interactions (si

, ai

, ri

, s0

i

) we want to learn the optimal value

function using a neural network. The underlying MDP has a finite sized action space.

1. (written 4pts) Your friend first suggests the following approach

(a) Initialize parameters φ of neural network Vφ

(b) For each batch of k tuples (si

, ai

, ri

, s0

i

) do Stochastic Gradient Descent with loss function Pk

i=0 |yi−

Vφ(si)|

2 where yi = maxai

[ri + γVφ(s

0

i

)]

What is the problem with this approach? (Hint: Think about the type of data we have)

2. (written 3pts) Your friend now suggests the following

(a) Initialize parameters φ of neural network for state-action value function Qφ(s, a)

(b) For each batch of k tuples (si

, ai

, ri

, s0

i

) do Stochastic Gradient Descent with loss function Pk

i=0 |yi−

Qφ(si

, ai)|

2 where yi = ri + γV (s

0

i

)

Now as we just have the network Qφ(s, a) how would you determine V (s) needed for the above training

procedure?

3. (written 3pts) Is the above method of learning the Q network guaranteed to give us an approximation

of the optimal state action value function?

Page 6 of 6