An Introduction to
Deep
Reinforcement
Learning
Vishal A. Bhalla
Technical University of Munich (TUM), Germany
Talk @ Big Data & Data Science Meetup | Bogotá, Colombia, 4th
Sep ‘17.
1
About Me
● Masters Student in Informatics (CS) at Technical University of Munich (TUM)
○ Major focus in Artificial Intelligence (AI) & Natural Language Understanding (NLU)
○ Applied wide range of Machine Learning (ML) algorithms in Automotive, Robotics,
Medical Imaging & Security domains
● Interested in exploring Deep Reinforcement Learning (RL) methods for NLU & Dialogue
Systems
● Happy to connect for collaborations on novel and challenging projects
An Introduction to Deep Reinforcement Learning “Big Data & Data Science Meetup” 4th
Sep 2017 @ Bogotá, Colombia Vishal Bhalla, Student M Sc. Informatics @ TUM
2
Agenda
● Introduction
● Theory & Concepts
● Approaches
● Key Players & Toolkits
● Research considerations
● Envoi
3
Introduction
4
Motivation
5
● Goes beyond input-output pattern recognition
● Synergy of
Deep Neural Networks + Reinforcement Learning
● ‘Mapping’ sensors to actions
● Build new applications
Image courtesy: OpenAI Blog on Evolution Strategies
Major breakthrough!
● AlphaGo defeating the Go World Champion
6
Image courtesy: The Guardian Image courtesy: Twitter - Deep Mind AI
Applications
● Learning to play
Atari games
from raw pixels
7
Video courtesy: YouTube @DeepMind - DQN Breakout
Applications (2)
● Games
● Robotics
● Energy Conservation
● Healthcare
● Dialogue Systems
● Marketing
8
Video courtesy: Bipedal Walker - Evolution Strategy Variant + OpenAI Gym
Applications (3)
● Producing flexible behaviours in simulated environments
9
GIF courtesy: Deep Mind Blog
Applications (4)
● AI research in the real-time strategy game StarCraft II & DOTA 2
10
Image courtesy: (L) SC2LE - an RL environment based on StarCraft II from DeepMind & Blizzard and (R) A bot which beats the world’s top professionals at 1v1 matches of Dota 2 under standard tournament rules
RL Theory &
Concepts
11
Reinforcement Learning (RL)
● Inspired by research into animal learning
● Correct input/label pairs are never presented
● Focus is on on-line performance
● Used in environments where,
○ No analytic solution
○ Simulation Model
○ Interaction only
● Eg: Making robots learn, how to walk
○ Reward: Head position
12
Typical RL scenario
13
Environment
Agent
ActionState
Reward
Markov Decision Processes (MDPs)
14
● State transition model p(st+1
| st
, at
) where,
s - state & a - action
● Reward p(rt+1
| st
, at
)
○ Depends on the current state and the
action performed
● Discount factor ∈ [0,1]
○ Controls the importance of future rewards
A simple MDP
Image courtesy: Wikipedia
Policy
● Agent - Choice of which action to perform
● Policy - Function of current environment state
● Action - Returns the best one
● Deterministic vs Stochastic environment
15
Rewards
● Agent’s goal: Pick best policy that maximises total reward
● Naive approach - Sum up rewards at each time step
where, T is the horizon (episode length) which can be infinity
● Discount factor importance
○ Reward doesn’t go to infinity as 0 ≤ ≤ 1
○ Preference for immediate rewards
16
Brute force
● 2 main steps
○ Sample returns after following each policy
○ Chose one with largest expected return
● Issues
○ Large or infinite policies
○ Large no. of samples required to handle variance of returns
● Solutions
○ Give some structure
○ Allow samples of one policy to influence estimates of other
17
Types
18
● Model based
1. Agent knows the MDP model
2. Agent uses it to (offline) plan
actions before any interactions
with environment
3. Eg: Value-iteration &
policy-iteration
● Model Free
1. Initial knowledge about possible
state-actions but not MDP model
2. Improves (online) through
learning from the interactions
with the environment
3. Eg: Q-Learning
Value Function
● Goodness of a state
● Expected total reward from start state s
● Depends on the policy
● There exists an optimal value function with the highest value
● Optimal policy *
19
Value Iteration
● Iteratively compute optimal state value function V(s)
● Guaranteed to converge to optimal values
20
Policy Iteration
● Re-define the policy at each step
● Compute value function for this new policy until the policy converges
● Guaranteed to converge
21
Value vs Policy Iteration
● Used for Offline planning
○ Prior knowledge about MDP
● Policy Iteration is computationally efficient compared to Value Iteration
○ Takes fewer iterations to converge
○ However, each iteration is computationally expensive
22
Q Learning
● Model free
● Quality of certain action in given state
● Q(st
,at
) = maxπ
Rt+1
such that π(s) = argmaxa
Q(s,a)
● Bellman equation
○ Q(s,a) = r + γ.maxa’
Q(s′,a′)
● Iterative Algorithm
● Q-function will converge and represent the true Q-value
23
Going Deep (RL)!
24
Deep Q-Learning
● Q-Learning uses tables to store data
● Combine function approximation with Neural Networks
● Eg: Deep RL for Atari Games
● 1067970
rows in our imaginary Q-table, more than the no. of atoms in the known universe!
● Other variants
○ Double DQN to correct over-estimated action values
○ Online version: Delayed Q-Learning with PAC
○ Greedy, Speedy Q-Learning, etc.
25
Deep Q Network
● Only game screens (and action) as input
● Output Q-value for each possible action
● One Forward pass
● CNN - No pooling
26
State
Action
Neural
Network
Q-Value
State
Neural Network
Q-Value1 Q-Value1 Q-Value1
Naive formulation of deep Q-network. Optimized architecture of deep Q-network (first used in DeepMind paper)
Policy Gradients
● Policy p has a set of ‘n’ real valued parameters q = {q1
, q2
, …, qn
}
● Calculate the reward gradient
qi
∀ i q ← qi
+ qi
R R
● Same as Supervised Learning
● Safe exploration and faster than value based methods
● Locally best parameter
● Parameterised policy & high dimensional space
● Advantage - ∑i
Ai
logp(yi
∣xi
)
27
Actor-Critic Algorithms
● Agent uses the Value estimate (critic) to
update the Policy (actor)
● Value function as a baseline for policy gradients
● Utilise a learned value function.
28
Actor-Critic
Asynchronous Advantage Actor-Critic (A3C)
● A3C utilizes multiple Worker agents
● Speedup & Diverse Experience
● Combines benefits of Value & Policy Iteration
● Continuous & Discrete action spaces
29
Images(L-R): A3C: Training workflow of each worker agent (L) and High-level architecture (R)
Break
30
Examples
31
Dialogue Systems: Interactive RL
32
● Conversational flow.
● Concept of delayed reward fits well to Dialogue
ICLR 2017 by FAIR: Learning Through Dialogue Interactions By Asking Questions
Dialogue Systems: Deep RL
33
● Actor-Critic method
● 2 Stage training → Supervised Learning + RL
○ Supervised → Mimic human behaviour
○ RL → Handle unforeseen situations
● User simulations for training
● Infinite state space of probability distributions
● Dialogue act-slot type combinations Image courtesy: Maluuba: Applying Deep Reinforcement Learning to Dialogue Management
Key Players &
Toolkits
34
Key Players
35
Labs & Groups
● Berkeley Artificial Intelligence Research (BAIR) Lab
○ UC Berkeley EE Department
● Univ. of Alberta, Edmonton, Canada
○ Deep Mind’s 1st international office
36
Richard Sutton, Michael Bowling and Patrick Pilarski @Univ of Alberta
Image courtesy: Deep Mind Blog
Researchers
● Prof. Peter Abeel, Sergey Levine & Chelsea Finn
○ BAIR, UC Berkeley EE Dept.
● Rich Sutton
○ Univ of Alberta
● David Silver, Oriol Vinyals & Vlad Mnih
○ Google DeepMind
● Ilya Sutskever, Rocky Duan & John Schulman
○ Open AI
● Jason Weston
○ Facebook AI Research
(FAIR)
37
Chelsea Finn, Sergey Levine & Peter Abeel from UC Berkeley.
Image courtesy: The New York Times
Tools
● High-quality implementations of reinforcement learning algorithms
○ OpenAI Baselines
○ ChainerRL
● Environments with a set of test problems to write & evaluate RL algorithms
○ OpenAI Gym
○ RLLab
38
Research Frontiers
39
Experience Replay
● Problem:
○ Approximate Q-functions using a CNN
○ Non-linearity is not stable and takes time to converge
● Trick:
○ Store all experiences < s, a, r, s’ > in a replay memory
○ Use random mini-batches from it
○ Avoids local minimum by breaking similarity between subsequent training samples
○ Makes it similar to Supervised Learning
40
Exploration vs Exploitation?
● Should the agent,
○ Trust the learnt Q values for every action? Or
○ Try other actions which might give a better reward
● Q-learning algorithm incorporates a greedy exploration
● Fix: -greedy approach!
○ Pick a random action (explore) with probability Or
○ Select an action according to current Q-values with probability (1- )
○ Decrease over time as agent becomes confident
41
Genetic Algorithm
● Evolutionary Computations family of AI
● Meta-heuristic optimization method
● Requirements
○ Represent as string of chromosomes (array of bits)
○ Fitness function to evaluate solutions
● Steps
○ Generation - Pool of candidate solutions
○ Next Gen- candidate sol with higher fitness value
■ Selection
■ Crossover
■ Mutation
○ Iterate till solution with goal fitness value
42
Image courtesy: The Genetic Algorithm - Explained
Evolution Strategies
● Black-box stochastic optimization
● Fit ‘n’ no. of parameters to a single reward function
● Tweak and guess iteratively
● Tradeoff vs RL
○ No need for backpropagation
○ Highly parallelizable
○ Higher robustness.
○ Structured exploration.
○ Credit assignment over long time scales
● https://blog.openai.com/evolution-strategies/
43
Exploration with Parameter noise
● Traditional RL uses action space noise
● Parameter space noise injects randomness
directly into the parameters of the agent
● A middle ground between
Evolution Strategies & Deep RL
44
Image courtesy: Better Exploration with Parameter Noise
Current Research & Other Challenges
● Model-based RL
● Inverse RL & Imitation Learning - Makes use of GAN’s
● Hierarchical (of policies) RL
● Multi-agent RL (MARL)
● Memory & Attention
● Transfer Learning
● Benchmarks
45
Envoi
46
Summary
● Stable and scalable RL is possible
● Deep networks represent value, policy and model
● Applications - Games, Robotics, Dialogue Systems, etc.
● Lot of hacks and advanced Deep RL paradigms required still
● Observing the agent is a rewarding experience!
47
References
● Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al.
Human-level control through deep reinforcement learning. [MnihDQN16]
In Nature 518, no. 7540 (2015): 529-533.
● Mnih, Volodymyr, Adria P. Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, & Koray Kavukcuoglu.
Asynchronous methods for deep reinforcement learning. [MnihA3C16]
In International Conference on Machine Learning, pp. 1928-1937. 2016.
● Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage and Anil Anthony Bharath.
A Brief Survey of Deep Reinforcement Learning. [KaiDeepRLSurvey17]
In IEEE Signal Processing Magazine, Special Issue on Deep Learning for Image Understanding.
● Wang, Ziyu, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Remi Munos, Koray Kavukcuoglu, and Nando de Freitas.
Sample efficient actor-critic with experience replay. [WangACExpReplay17]
In arXiv preprint arXiv:1611.01224 (2016).
48
Additional Links
● Blogs
○ Deep RL (Episode 0-2) blog series by Moustafa Alzantot
○ Demystifying Deep RL guest post by Tambet Matiisen at Intel-Nervana Systems
○ Maluuba’s blog on Deep RL for Dialogue Systems
○ Simple Reinforcement Learning with Tensorflow 8 Part Series by Arthur Juliani
○ Deep Reinforcement Learning: Pong from Pixels by Andrej Karpathy
● Tutorials
○ David Silver's Deep RL video-lectures
○ Tutorial on Deep RL by Sergey Levine & Chelsea Finn at ICML 2017
○ Deep RL Bootcamp in Berkeley, California USA
49
Questions?
Image courtesy: travelblogadvice
50
Image courtesy: bethratzlaff
51
Backup Slides
52
The End
53

An introduction to deep reinforcement learning

  • 1.
    An Introduction to Deep Reinforcement Learning VishalA. Bhalla Technical University of Munich (TUM), Germany Talk @ Big Data & Data Science Meetup | Bogotá, Colombia, 4th Sep ‘17. 1
  • 2.
    About Me ● MastersStudent in Informatics (CS) at Technical University of Munich (TUM) ○ Major focus in Artificial Intelligence (AI) & Natural Language Understanding (NLU) ○ Applied wide range of Machine Learning (ML) algorithms in Automotive, Robotics, Medical Imaging & Security domains ● Interested in exploring Deep Reinforcement Learning (RL) methods for NLU & Dialogue Systems ● Happy to connect for collaborations on novel and challenging projects An Introduction to Deep Reinforcement Learning “Big Data & Data Science Meetup” 4th Sep 2017 @ Bogotá, Colombia Vishal Bhalla, Student M Sc. Informatics @ TUM 2
  • 3.
    Agenda ● Introduction ● Theory& Concepts ● Approaches ● Key Players & Toolkits ● Research considerations ● Envoi 3
  • 4.
  • 5.
    Motivation 5 ● Goes beyondinput-output pattern recognition ● Synergy of Deep Neural Networks + Reinforcement Learning ● ‘Mapping’ sensors to actions ● Build new applications Image courtesy: OpenAI Blog on Evolution Strategies
  • 6.
    Major breakthrough! ● AlphaGodefeating the Go World Champion 6 Image courtesy: The Guardian Image courtesy: Twitter - Deep Mind AI
  • 7.
    Applications ● Learning toplay Atari games from raw pixels 7 Video courtesy: YouTube @DeepMind - DQN Breakout
  • 8.
    Applications (2) ● Games ●Robotics ● Energy Conservation ● Healthcare ● Dialogue Systems ● Marketing 8 Video courtesy: Bipedal Walker - Evolution Strategy Variant + OpenAI Gym
  • 9.
    Applications (3) ● Producingflexible behaviours in simulated environments 9 GIF courtesy: Deep Mind Blog
  • 10.
    Applications (4) ● AIresearch in the real-time strategy game StarCraft II & DOTA 2 10 Image courtesy: (L) SC2LE - an RL environment based on StarCraft II from DeepMind & Blizzard and (R) A bot which beats the world’s top professionals at 1v1 matches of Dota 2 under standard tournament rules
  • 11.
  • 12.
    Reinforcement Learning (RL) ●Inspired by research into animal learning ● Correct input/label pairs are never presented ● Focus is on on-line performance ● Used in environments where, ○ No analytic solution ○ Simulation Model ○ Interaction only ● Eg: Making robots learn, how to walk ○ Reward: Head position 12
  • 13.
  • 14.
    Markov Decision Processes(MDPs) 14 ● State transition model p(st+1 | st , at ) where, s - state & a - action ● Reward p(rt+1 | st , at ) ○ Depends on the current state and the action performed ● Discount factor ∈ [0,1] ○ Controls the importance of future rewards A simple MDP Image courtesy: Wikipedia
  • 15.
    Policy ● Agent -Choice of which action to perform ● Policy - Function of current environment state ● Action - Returns the best one ● Deterministic vs Stochastic environment 15
  • 16.
    Rewards ● Agent’s goal:Pick best policy that maximises total reward ● Naive approach - Sum up rewards at each time step where, T is the horizon (episode length) which can be infinity ● Discount factor importance ○ Reward doesn’t go to infinity as 0 ≤ ≤ 1 ○ Preference for immediate rewards 16
  • 17.
    Brute force ● 2main steps ○ Sample returns after following each policy ○ Chose one with largest expected return ● Issues ○ Large or infinite policies ○ Large no. of samples required to handle variance of returns ● Solutions ○ Give some structure ○ Allow samples of one policy to influence estimates of other 17
  • 18.
    Types 18 ● Model based 1.Agent knows the MDP model 2. Agent uses it to (offline) plan actions before any interactions with environment 3. Eg: Value-iteration & policy-iteration ● Model Free 1. Initial knowledge about possible state-actions but not MDP model 2. Improves (online) through learning from the interactions with the environment 3. Eg: Q-Learning
  • 19.
    Value Function ● Goodnessof a state ● Expected total reward from start state s ● Depends on the policy ● There exists an optimal value function with the highest value ● Optimal policy * 19
  • 20.
    Value Iteration ● Iterativelycompute optimal state value function V(s) ● Guaranteed to converge to optimal values 20
  • 21.
    Policy Iteration ● Re-definethe policy at each step ● Compute value function for this new policy until the policy converges ● Guaranteed to converge 21
  • 22.
    Value vs PolicyIteration ● Used for Offline planning ○ Prior knowledge about MDP ● Policy Iteration is computationally efficient compared to Value Iteration ○ Takes fewer iterations to converge ○ However, each iteration is computationally expensive 22
  • 23.
    Q Learning ● Modelfree ● Quality of certain action in given state ● Q(st ,at ) = maxπ Rt+1 such that π(s) = argmaxa Q(s,a) ● Bellman equation ○ Q(s,a) = r + γ.maxa’ Q(s′,a′) ● Iterative Algorithm ● Q-function will converge and represent the true Q-value 23
  • 24.
  • 25.
    Deep Q-Learning ● Q-Learninguses tables to store data ● Combine function approximation with Neural Networks ● Eg: Deep RL for Atari Games ● 1067970 rows in our imaginary Q-table, more than the no. of atoms in the known universe! ● Other variants ○ Double DQN to correct over-estimated action values ○ Online version: Delayed Q-Learning with PAC ○ Greedy, Speedy Q-Learning, etc. 25
  • 26.
    Deep Q Network ●Only game screens (and action) as input ● Output Q-value for each possible action ● One Forward pass ● CNN - No pooling 26 State Action Neural Network Q-Value State Neural Network Q-Value1 Q-Value1 Q-Value1 Naive formulation of deep Q-network. Optimized architecture of deep Q-network (first used in DeepMind paper)
  • 27.
    Policy Gradients ● Policyp has a set of ‘n’ real valued parameters q = {q1 , q2 , …, qn } ● Calculate the reward gradient qi ∀ i q ← qi + qi R R ● Same as Supervised Learning ● Safe exploration and faster than value based methods ● Locally best parameter ● Parameterised policy & high dimensional space ● Advantage - ∑i Ai logp(yi ∣xi ) 27
  • 28.
    Actor-Critic Algorithms ● Agentuses the Value estimate (critic) to update the Policy (actor) ● Value function as a baseline for policy gradients ● Utilise a learned value function. 28 Actor-Critic
  • 29.
    Asynchronous Advantage Actor-Critic(A3C) ● A3C utilizes multiple Worker agents ● Speedup & Diverse Experience ● Combines benefits of Value & Policy Iteration ● Continuous & Discrete action spaces 29 Images(L-R): A3C: Training workflow of each worker agent (L) and High-level architecture (R)
  • 30.
  • 31.
  • 32.
    Dialogue Systems: InteractiveRL 32 ● Conversational flow. ● Concept of delayed reward fits well to Dialogue ICLR 2017 by FAIR: Learning Through Dialogue Interactions By Asking Questions
  • 33.
    Dialogue Systems: DeepRL 33 ● Actor-Critic method ● 2 Stage training → Supervised Learning + RL ○ Supervised → Mimic human behaviour ○ RL → Handle unforeseen situations ● User simulations for training ● Infinite state space of probability distributions ● Dialogue act-slot type combinations Image courtesy: Maluuba: Applying Deep Reinforcement Learning to Dialogue Management
  • 34.
  • 35.
  • 36.
    Labs & Groups ●Berkeley Artificial Intelligence Research (BAIR) Lab ○ UC Berkeley EE Department ● Univ. of Alberta, Edmonton, Canada ○ Deep Mind’s 1st international office 36 Richard Sutton, Michael Bowling and Patrick Pilarski @Univ of Alberta Image courtesy: Deep Mind Blog
  • 37.
    Researchers ● Prof. PeterAbeel, Sergey Levine & Chelsea Finn ○ BAIR, UC Berkeley EE Dept. ● Rich Sutton ○ Univ of Alberta ● David Silver, Oriol Vinyals & Vlad Mnih ○ Google DeepMind ● Ilya Sutskever, Rocky Duan & John Schulman ○ Open AI ● Jason Weston ○ Facebook AI Research (FAIR) 37 Chelsea Finn, Sergey Levine & Peter Abeel from UC Berkeley. Image courtesy: The New York Times
  • 38.
    Tools ● High-quality implementationsof reinforcement learning algorithms ○ OpenAI Baselines ○ ChainerRL ● Environments with a set of test problems to write & evaluate RL algorithms ○ OpenAI Gym ○ RLLab 38
  • 39.
  • 40.
    Experience Replay ● Problem: ○Approximate Q-functions using a CNN ○ Non-linearity is not stable and takes time to converge ● Trick: ○ Store all experiences < s, a, r, s’ > in a replay memory ○ Use random mini-batches from it ○ Avoids local minimum by breaking similarity between subsequent training samples ○ Makes it similar to Supervised Learning 40
  • 41.
    Exploration vs Exploitation? ●Should the agent, ○ Trust the learnt Q values for every action? Or ○ Try other actions which might give a better reward ● Q-learning algorithm incorporates a greedy exploration ● Fix: -greedy approach! ○ Pick a random action (explore) with probability Or ○ Select an action according to current Q-values with probability (1- ) ○ Decrease over time as agent becomes confident 41
  • 42.
    Genetic Algorithm ● EvolutionaryComputations family of AI ● Meta-heuristic optimization method ● Requirements ○ Represent as string of chromosomes (array of bits) ○ Fitness function to evaluate solutions ● Steps ○ Generation - Pool of candidate solutions ○ Next Gen- candidate sol with higher fitness value ■ Selection ■ Crossover ■ Mutation ○ Iterate till solution with goal fitness value 42 Image courtesy: The Genetic Algorithm - Explained
  • 43.
    Evolution Strategies ● Black-boxstochastic optimization ● Fit ‘n’ no. of parameters to a single reward function ● Tweak and guess iteratively ● Tradeoff vs RL ○ No need for backpropagation ○ Highly parallelizable ○ Higher robustness. ○ Structured exploration. ○ Credit assignment over long time scales ● https://blog.openai.com/evolution-strategies/ 43
  • 44.
    Exploration with Parameternoise ● Traditional RL uses action space noise ● Parameter space noise injects randomness directly into the parameters of the agent ● A middle ground between Evolution Strategies & Deep RL 44 Image courtesy: Better Exploration with Parameter Noise
  • 45.
    Current Research &Other Challenges ● Model-based RL ● Inverse RL & Imitation Learning - Makes use of GAN’s ● Hierarchical (of policies) RL ● Multi-agent RL (MARL) ● Memory & Attention ● Transfer Learning ● Benchmarks 45
  • 46.
  • 47.
    Summary ● Stable andscalable RL is possible ● Deep networks represent value, policy and model ● Applications - Games, Robotics, Dialogue Systems, etc. ● Lot of hacks and advanced Deep RL paradigms required still ● Observing the agent is a rewarding experience! 47
  • 48.
    References ● Mnih, Volodymyr,Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al. Human-level control through deep reinforcement learning. [MnihDQN16] In Nature 518, no. 7540 (2015): 529-533. ● Mnih, Volodymyr, Adria P. Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, & Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. [MnihA3C16] In International Conference on Machine Learning, pp. 1928-1937. 2016. ● Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage and Anil Anthony Bharath. A Brief Survey of Deep Reinforcement Learning. [KaiDeepRLSurvey17] In IEEE Signal Processing Magazine, Special Issue on Deep Learning for Image Understanding. ● Wang, Ziyu, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Remi Munos, Koray Kavukcuoglu, and Nando de Freitas. Sample efficient actor-critic with experience replay. [WangACExpReplay17] In arXiv preprint arXiv:1611.01224 (2016). 48
  • 49.
    Additional Links ● Blogs ○Deep RL (Episode 0-2) blog series by Moustafa Alzantot ○ Demystifying Deep RL guest post by Tambet Matiisen at Intel-Nervana Systems ○ Maluuba’s blog on Deep RL for Dialogue Systems ○ Simple Reinforcement Learning with Tensorflow 8 Part Series by Arthur Juliani ○ Deep Reinforcement Learning: Pong from Pixels by Andrej Karpathy ● Tutorials ○ David Silver's Deep RL video-lectures ○ Tutorial on Deep RL by Sergey Levine & Chelsea Finn at ICML 2017 ○ Deep RL Bootcamp in Berkeley, California USA 49
  • 50.
  • 51.
  • 52.
  • 53.