## Description

Reinforcement Learning Assignment 4

1 Introduction

The goal of this assignment is to do experiment with deep Q network (DQN),

which combines the advantage of Q-learning and the neural network. In classical

Q-learning methods, the action value function Q is intractable with the increase

of state space and action space. DQN introduces the success of deep learning

and has achieved a super-human level of play in atari games. Your goal is to

implement the DQN algorithm and its improved algorithm and play with them

in some classical RL control scenarios.

2 Deep Q-learning

sometimes a nonlinear function approximator is used instead, such as a neural

network. We refer to a neural network function approximator with weights h as a

Q-network. A Q-network can be trained by adjusting the parameters hi at iteration

i to reduce the mean-squared error in the Bellman equation, where the optimal

target values rzc maxa0 Q s

0

,a0 ð Þ are substituted with approximate target values

y~rzc maxa0 Q s

0

,a0

; h{

i

, using parameters h{

i from some previous iteration.

This leads to a sequence of loss functions Li(hi) that changes at each iteration i,

Lið Þ hi ~ s,a,r Es ð Þ 0 ½ yDs,a {Q sð Þ ,a; hi 2 ? ?

~ s,a,r,s0 ð Þ y{Q sð Þ ,a; hi 2 ? ?zEs,a,r Vs ½ 0 ½ y :

Note that the targets depend on the network weights; this is in contrast with the

targets used for supervised learning, which are fixed before learning begins. At

each stage of optimization, we hold the parameters from the previous iteration hi

2

fixed when optimizing the ith loss function Li(hi), resulting in a sequence of welldefined optimization problems. The final term is the variance of the targets, which

does not depend on the parameters hi that we are currently optimizing, and may

therefore be ignored. Differentiating the loss function with respect to the weights

we arrive at the following gradient:

+hi

Lð Þ hi ~ s,a,r,s0 rzc max

a0 Q s

0

,a0

; h{

i

{Qð Þ s,a; hi

? ?+hiQð Þ s,a; hi

? :

Rather than computing the full expectations in the above gradient, it is often

computationally expedient to optimize the loss function by stochastic gradient

descent. The familiar Q-learning algorithm19 can be recovered in this framework

by updating the weights after every time step, replacing the expectations using

single samples, and setting h{

i ~hi{1.

Note that this algorithm is model-free: it solves the reinforcement learning task

directly using samples from the emulator, without explicitly estimating the reward

and transition dynamics P r,s

0 ð Þ Ds,a . It is also off-policy: it learns about the greedy

policy a~argmaxa0Q s,a0 ð Þ ; h , while following a behaviour distribution that ensures

adequate exploration of the state space. In practice, the behaviour distribution is

often selected by an e-greedy policy that follows the greedy policy with probability

1 2 e and selects a random action with probability e.

Training algorithm for deep Q-networks. The full algorithm for training deep

Q-networks is presented in Algorithm 1. The agent selects and executes actions

according to an e-greedy policy based on Q. Because using histories of arbitrary

length as inputs to a neural network can be difficult, our Q-function instead works

on a fixed length representation of histories produced by the function w described

above. The algorithm modifies standard online Q-learning in two ways to make it

suitable for training large neural networks without diverging.

First, we use a technique known as experience replay23 in which we store the

agent’s experiences at each time-step,et5 (st, at,rt,st 1 1), in a data setDt5 {e1,…,et},

pooled over many episodes (where the end of an episode occurs when a terminal state is reached) into a replay memory. During the inner loop of the algorithm,

we apply Q-learning updates, or minibatch updates, to samples of experience,

(s, a,r,s9) , U(D), drawn at random from the pool of stored samples. This approach

has several advantages over standard online Q-learning. First, each step of experience

is potentially used in many weight updates, which allowsfor greater data efficiency.

Second, learning directly from consecutive samples is inefficient, owing to the strong

correlations between the samples; randomizing the samples breaks these correlations and therefore reduces the variance of the updates. Third, when learning onpolicy the current parameters determine the next data sample that the parameters

are trained on. For example, if the maximizing action is to move left then the training samples will be dominated by samples from the left-hand side; if the maximizing action then switches to the right then the training distribution will also switch.

Itis easy to see how unwanted feedback loops may arise and the parameters could get

stuckin a poor localminimum, or evendiverge catastrophically20. By using experience

replay the behaviour distribution is averaged over many of its previous states,

smoothing out learning and avoiding oscillations or divergence in the parameters.

Note that when learning by experience replay, it is necessary to learn off-policy

(because our current parameters are different to those used to generate the sample), which motivates the choice of Q-learning.

In practice, our algorithm only stores the last N experience tuples in the replay

memory, and samples uniformly at random from Dwhen performing updates. This

approach is in some respects limited because the memory buffer does not differentiate important transitions and always overwrites with recent transitions owing

to the finite memory size N. Similarly, the uniform sampling gives equal importance to all transitions in the replay memory. A more sophisticated sampling strategy might emphasize transitions from which we can learn the most, similar to

prioritized sweeping30.

The second modification to online Q-learning aimed at further improving the

stability of our method with neural networks is to use a separate network for generating the targets yj in the Q-learning update. More precisely, every C updates we

clone the network Q to obtain a target network Q^ and use Q^ for generating the

Q-learning targets yj for the followingC updates to Q. This modification makes the

algorithm more stable compared to standard online Q-learning, where an update

that increasesQ(st,at) often also increasesQ(st 1 1,a)for all a and hence also increases

the target yj, possibly leading to oscillations or divergence of the policy. Generating

the targets using an older set of parameters adds a delay between the time an update

to Q is made and the time the update affects the targets yj

, making divergence or

oscillations much more unlikely.

We also found it helpful to clip the error term from the update rzc maxa0 Q

s

0

,a0

; h{

i

{Qð Þ s,a; hi to be between 21 and 1. Because the absolute value loss

function jxj has a derivative of 21 for all negative values of x and a derivative of 1

for all positive values of x, clipping the squared error to be between 21 and 1 corresponds to using an absolute value loss function for errors outside of the (21,1)

interval. This form of error clipping further improved the stability of the algorithm.

Algorithm 1: deep Q-learning with experience replay.

Initialize replay memory D to capacity N

Initialize action-value function Q with random weights h

Initialize target action-value function Q^ with weights h2 5 h

For episode 5 1, M do

Initialize sequence s1~f g x1 and preprocessed sequence w1~wð Þ s1

For t 5 1,T do

With probability e select a random action at

otherwise select at~argmaxaQ wð Þ st ð Þ ,a; h

Execute action at in emulator and observe reward rt and image xt 1 1

Set stz1~st,at,xtz1 and preprocess wtz1~wð Þ stz1

Store transition wt,at,rt,wtz1

in D

Sample random minibatch of transitions wj,aj,rj,wjz1

from D

Set yj~ rj if episode terminates at step jz1

rjzc maxa0 Q^ wjz1,a0

; h{

otherwise (

Perform a gradient descent step on yj{Q wj,aj; h

2

with respect to the

network parameters h

Every C steps reset Q^~Q

End For

End For

31. Jarrett, K., Kavukcuoglu, K., Ranzato, M. A. & LeCun, Y.What is the bestmulti-stage

architecture for object recognition? Proc. IEEE. Int. Conf. Comput. Vis. 2146–2153

(2009).

32. Nair, V. & Hinton, G. E. Rectified linear units improve restricted Boltzmann

machines. Proc. Int. Conf. Mach. Learn. 807–814 (2010).

33. Kaelbling, L. P., Littman, M. L. & Cassandra, A. R. Planning and acting in partially

observable stochastic domains. Artificial Intelligence 101, 99–134 (1994).

LETTER RESEARCH

©2015 Macmillan Publishers Limited. All rights reserved

Figure 1: Deep Q-learning with experience replay

You can refer to the original paper for the details of the DQN. “Human-level

control through deep reinforcement learning.” Nature 518.7540 (2015): 529.

3 Experiment Description

• Programming language: python3

• You should compare the performance of DQN and one kind of improved

DQN and test them in a classical RL control environment–MountainCar.

OPENAI gym provides this environment, which is implemented with python

(https://gym.openai.com/envs/MountainCar-v0/). What’s more, gym also

provides other more complex environment like atari games and mujoco.

2

Since the state is abstracted into car’s position, convolutional layer is not

necessary in our experiment. You can get started with OPENAI gym refer

to this link (https://gym.openai.com/docs/). Note that it is suggested to

implement your neural network on the Tensorflow or Pytorch.

4 Report and Submission

• Your report and source code should be compressed and named after “studentID+name+assignment4”.