DeepReinforcementLearning
Agentileand(almost)mathfreeintroduction
simone@ai-academy.com
1
Outline
What, why and where it stands in ML?
General framework
Q-Learning
Deep + ...
Codeanddemo(hopefullyworking...)
2
3
Atari[Nature,2015]
4
AlphaGo[Nature,2016]
5
"RL tries to understand the optimal way to make
decisions."
David Silver - Research Scientist, Google DeepMind
6
Howdoesitwork?
= ( , , , , . . . , , , , )Ht s1 a1 r1 s2 st at rt sT
7
Whyitisdi erentfromotherMLsettings?
No supervisor
Delayed feedback
Time matters
Data depends on the Agent policy
8
Whatcanwemodel?
Environment & Agent
9
Universe
10
Environment
import Gym
env = Gym.make('SpaceInvaders-v0')
s = env.step()
# take an action
terminal = False
while not terminal:
next_state, reward, terminal, _ = env.step(action)
11
s_1 -> s_2 -> s_3
12
Keyassumptions
1. The probability of the next state depends only on the current state
2. Each state contains all the relevant information
13
It'sMe,Mario!
Mario wants to break bricks and free the princess!
14
Expectedfuturerewards
Any goal can be represented as a sum of intermediate rewards.
[ ∣ ] = [ + γ + + … ∣ ]∑
∞
t=0
γ
t
Rt St R0 R1 γ
2
R2 St
15
Tools
1. Policy:
2. Value function:
3. Model:
We have to pick at least 1 of the 3.
π(a|s)
Q(s, a)
(P, R)
16
Policy
A policy de nes how the agent behaves.
It takes as input a state and output an action.
It can be stochastic, or deterministic.
17
Valuefunction
A value function estimates how much reward the agent can achieve.
It takes as input a (state,action), and output values.
One for each possible action.
18
Model
A model is the Agent representation of the environment.
Takes as input a state and output (next_state,reward).
19
Designchoice
Balance learning and planning
Explore new actions and exploit good ones
Assign credits for correct actions
20
Quick-Q&A
21
Howtosolveit?
The goal is to nd the optimalpolicy that maximize the futureexpected
rewards.
22
Repeat
1. Prediction: Compute the value of the expected reward from until the
terminal state.
2. Control: Act greedly with respect to the predicted values.
st
23
Approximationofthevaluefunction
Monte Carlo (used in Alpha Go)
Temporal Di erence (used in Atari)
24
TemporalDi erence
25
Pavlovianconditioning
26
Updaterule
In rabbits, humans and machines we get the same algorithm:
while True:
Q[t] = Q[t-1] + alpha * (Q_target - Q[t-1])
27
Q-Learning[Watkins,1989]
The agent does not have a model of the environment.
Perform actions following a standard policy.
Predict using the target policy.
Which makes it an "o -policy", model-free method.
28
Lossfunction
Building on what we learn from the rabbit.
The learning goal is to minimize the following loss function:
Putting all together we get...
Q_target = r + gamma * np.argmax( Q(s, A))
Loss = 1/n * np.sum( (Q_target - Q(s,a))^2)
29
DeepQ-Learning
Let's add Neural Networks and we are good to go right?
30
31
Notice...
1. Data are highly correlated
2. The target values are not robust
3. Wild rewards make the value function freaks out
32
Wewish...
A stable Q_target, a robust Q and predictable rewards.
But how?
33
DeepMindideas
1. Di erent neural networks for Q and Q_target
2. Estimate Q_target using past experiences
3. Update Q_target every C steps
4. Clip rewards between -1 and 1
34
Network
Input: an image of shape [None, 42, 42, 4]
4 Conv2D 32 lters, 4x4 kernel
1 Hidden layer of size 256
1 Fully connected layer of size action_size
35
Hyperparams
Learning rate: 0.001
Reward clip: (-1, 1)
Gradient clip: 40
Optimizer: AdamOptimizer
36
37
Tools
Challenges
Demo
OpenAI
Tensor ow
General AI Challenge
Stanford
38
Resources:
Papers
RL - David Silver
Introduction to RL
Patacchiola Blog
Human Level control
Async Method for DRL
39
Q&A
40
ThankstoMachineLearning/DataScience
Meetup
simone@ai-academy.com
41

Deep Reinforcement Learning