The document provides an overview of deep reinforcement learning (DRL), outlining its fundamental concepts such as Q-learning, policy, value functions, and the role of agents in decision-making. It discusses various aspects including the significance of delayed feedback, the use of neural networks, and methods for improving stability in learning processes. Additionally, it highlights tools, challenges, and resources available for further exploration in the field of DRL.
Environment
import Gym
env =Gym.make('SpaceInvaders-v0')
s = env.step()
# take an action
terminal = False
while not terminal:
next_state, reward, terminal, _ = env.step(action)
11
Expectedfuturerewards
Any goal canbe represented as a sum of intermediate rewards.
[ ∣ ] = [ + γ + + … ∣ ]∑
∞
t=0
γ
t
Rt St R0 R1 γ
2
R2 St
15
17.
Tools
1. Policy:
2. Valuefunction:
3. Model:
We have to pick at least 1 of the 3.
π(a|s)
Q(s, a)
(P, R)
16
18.
Policy
A policy denes how the agent behaves.
It takes as input a state and output an action.
It can be stochastic, or deterministic.
17
19.
Valuefunction
A value functionestimates how much reward the agent can achieve.
It takes as input a (state,action), and output values.
One for each possible action.
18
20.
Model
A model isthe Agent representation of the environment.
Takes as input a state and output (next_state,reward).
19
Repeat
1. Prediction: Computethe value of the expected reward from until the
terminal state.
2. Control: Act greedly with respect to the predicted values.
st
23
Updaterule
In rabbits, humansand machines we get the same algorithm:
while True:
Q[t] = Q[t-1] + alpha * (Q_target - Q[t-1])
27
29.
Q-Learning[Watkins,1989]
The agent doesnot have a model of the environment.
Perform actions following a standard policy.
Predict using the target policy.
Which makes it an "o -policy", model-free method.
28
30.
Lossfunction
Building on whatwe learn from the rabbit.
The learning goal is to minimize the following loss function:
Putting all together we get...
Q_target = r + gamma * np.argmax( Q(s, A))
Loss = 1/n * np.sum( (Q_target - Q(s,a))^2)
29
DeepMindideas
1. Di erentneural networks for Q and Q_target
2. Estimate Q_target using past experiences
3. Update Q_target every C steps
4. Clip rewards between -1 and 1
34
36.
Network
Input: an imageof shape [None, 42, 42, 4]
4 Conv2D 32 lters, 4x4 kernel
1 Hidden layer of size 256
1 Fully connected layer of size action_size
35