Deep Reinforcement Learning

DeepReinforcementLearning
Agentileand(almost)mathfreeintroduction
simone@ai-academy.com
1

Outline
What, why and where it stands in ML?
General framework
Q-Learning
Deep + ...
Codeanddemo(hopefullyworking...)
2

"RL tries to understand the optimal way to make
decisions."
David Silver - Research Scientist, Google DeepMind
6

Howdoesitwork?
= ( , , , , . . . , , , , )Ht s1 a1 r1 s2 st at rt sT
7

Whyitisdi erentfromotherMLsettings?
No supervisor
Delayed feedback
Time matters
Data depends on the Agent policy
8

Whatcanwemodel?
Environment & Agent
9

Environment
import Gym
env = Gym.make('SpaceInvaders-v0')
s = env.step()
# take an action
terminal = False
while not terminal:
next_state, reward, terminal, _ = env.step(action)
11

Keyassumptions
1. The probability of the next state depends only on the current state
2. Each state contains all the relevant information
13

It'sMe,Mario!
Mario wants to break bricks and free the princess!
14

Expectedfuturerewards
Any goal can be represented as a sum of intermediate rewards.
[ ∣ ] = [ + γ + + … ∣ ]∑
∞
t=0
γ
t
Rt St R0 R1 γ
2
R2 St
15

Tools
1. Policy:
2. Value function:
3. Model:
We have to pick at least 1 of the 3.
π(a|s)
Q(s, a)
(P, R)
16

Policy
A policy de nes how the agent behaves.
It takes as input a state and output an action.
It can be stochastic, or deterministic.
17

Valuefunction
A value function estimates how much reward the agent can achieve.
It takes as input a (state,action), and output values.
One for each possible action.
18

Model
A model is the Agent representation of the environment.
Takes as input a state and output (next_state,reward).
19

Designchoice
Balance learning and planning
Explore new actions and exploit good ones
Assign credits for correct actions
20

Howtosolveit?
The goal is to nd the optimalpolicy that maximize the futureexpected
rewards.
22

Repeat
1. Prediction: Compute the value of the expected reward from until the
terminal state.
2. Control: Act greedly with respect to the predicted values.
st
23

Approximationofthevaluefunction
Monte Carlo (used in Alpha Go)
Temporal Di erence (used in Atari)
24

Updaterule
In rabbits, humans and machines we get the same algorithm:
while True:
Q[t] = Q[t-1] + alpha * (Q_target - Q[t-1])
27

Q-Learning[Watkins,1989]
The agent does not have a model of the environment.
Perform actions following a standard policy.
Predict using the target policy.
Which makes it an "o -policy", model-free method.
28

Lossfunction
Building on what we learn from the rabbit.
The learning goal is to minimize the following loss function:
Putting all together we get...
Q_target = r + gamma * np.argmax( Q(s, A))
Loss = 1/n * np.sum( (Q_target - Q(s,a))^2)
29

DeepQ-Learning
Let's add Neural Networks and we are good to go right?
30

Notice...
1. Data are highly correlated
2. The target values are not robust
3. Wild rewards make the value function freaks out
32

Wewish...
A stable Q_target, a robust Q and predictable rewards.
But how?
33

DeepMindideas
1. Di erent neural networks for Q and Q_target
2. Estimate Q_target using past experiences
3. Update Q_target every C steps
4. Clip rewards between -1 and 1
34

Network
Input: an image of shape [None, 42, 42, 4]
4 Conv2D 32 lters, 4x4 kernel
1 Hidden layer of size 256
1 Fully connected layer of size action_size
35

Hyperparams
Learning rate: 0.001
Reward clip: (-1, 1)
Gradient clip: 40
Optimizer: AdamOptimizer
36

Tools
Challenges
Demo
OpenAI
Tensor ow
General AI Challenge
Stanford
38

Resources:
Papers
RL - David Silver
Introduction to RL
Patacchiola Blog
Human Level control
Async Method for DRL
39

ThankstoMachineLearning/DataScience
Meetup
simone@ai-academy.com
41

Deep Reinforcement Learning

In this document

More Related Content

What's hot

Similar to Deep Reinforcement Learning

More from MeetupDataScienceRoma

Recently uploaded

Deep Reinforcement Learning