Make Decisions by Reinforcement Learning

Get Started. It's Free
or sign up with your email address
Make Decisions by Reinforcement Learning by Mind Map: Make Decisions by Reinforcement Learning

1. OpenAI

1.1. Gym

1.1.1. Atari

1.1.1.1. Freeway

1.1.1.2. Seaquest

1.2. Universe

2. How Far between Human and Machine

2.1. Machine Theory of Mind

2.2. Building Machines That Learn and Think Like People

3. Challenges / Issues

3.1. Memory

3.1.1. Gradient Episodic Memory for Continual Learning (机器学习在有一件事上一直做得不好, 那就是学习新问题的时候如何不忘记之前完成过的任务。 在这篇论文中,作者们提出了一种新的学习度量, 用于评估模型如何在一系列学习任务中迁移知识。)

3.2. Agents

3.2.1. Multi-agent learning

3.3. Environment / Application

3.3.1. Quick response (real time)

3.4. Reward

3.4.1. Sparse rewards

3.4.1.1. 對於只有成功和失敗的case,如果難以定義"快要成功", agent恐怕永遠得不到reward

3.4.1.1.1. Reverse Curriculum Generation for Reinforcement Learning Agents

3.4.2. Reward function not avaible

3.5. Action / States

3.5.1. Continued (action and/or states)

3.5.2. Large decision spaces

3.6. Attention

3.7. Others

3.7.1. Time dependencies

3.7.1.1. The observations of the agent depend on its actions and can contain strong TEMPORAL coorelations

3.7.1.2. Temporal credit assignment problem - long-range time dependencies

3.7.2. Learning models of game

3.7.3. Learning Results

3.7.3.1. Multiple levels

3.7.4. Focus on salient parts - Imbalance

3.8. General AI

3.8.1. Lifetime adaption / Continual learning

3.8.1.1. Learning to Compose Skills

3.8.1.2. Policy and Value Transfer in Lifelong Reinforcement Learning

3.8.1.3. Catastrophic Forgetting (need to Incorporate prior knowledge)

3.8.1.3.1. Represention

3.8.1.3.2. Teacher model

3.8.2. Human

3.8.2.1. Investigating Human Priors for Playing Video Games

3.8.2.2. Playing hard exploration games by watching YouTube

3.8.3. Adapt rapidly to new tasks

3.8.3.1. Transfer Learning

3.8.3.1.1. Symbolic

3.8.3.2. Multi-Task

3.8.3.2.1. Zero-Shot Task Generalization with Multi-Task Deep Reinforcement Learning

3.8.4. HRL - Hierarchical RL

3.8.4.1. Hierarchies of policies

3.8.4.1.1. Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation, NIPS'16

3.8.4.1.2. The Option-Critic Architecture, AAAI'17

3.8.4.1.3. Deep Successor Reinforcement Learning, NIPS'16 Workshop

3.8.4.1.4. Strategic Attentive Writer for Learning Macro-Actions, NIPS'16

3.8.4.1.5. FeUdal Networks for Hierarchical Reinforcement Learning, ICML'17

3.8.4.1.6. Meta learning shared hierarchies

3.8.4.1.7. ComposeNet: allows an agent to compose simple skills into a hierarchy to solve complicated tasks

3.8.4.1.8. Multi-Level Discovery of Deep Options

3.8.4.1.9. Context-Aware Policy Reuse

3.8.4.2. Others

3.8.4.2.1. Learning Parameterized Skills

3.8.4.2.2. Jointly Learning What and How from Instructions and Goal-States

3.8.4.2.3. Hierarchical Imitation and Reinforcement Learning

3.8.4.3. Applications

3.8.4.3.1. Feature Control as Intrinsic Motivation for Hierarchical Reinforcement Learning

4. Topics & Related works

4.1. Goal, Direction

4.1.1. Action model learning

4.1.2. Skill Chaining

4.1.2.1. Implementing Cst in Learning Layer of Csia for Higher Level of Intelligence

4.1.3. Automated Planning and Schedule

4.1.4. Action selection

5. Cool components form others works

5.1. Compression

5.1.1. 將用不到的weight減去或將不需要的維度降維→反過來就是重要的部分 (Skill ?)

5.1.2. Other Issues

5.1.2.1. Hierarchies / Transfer

5.1.2.1.1. Poincaré Embeddings for Learning Hierarchical Representations (在文字和计算图这样的符号化数据的建模过程中, 表征学习这种方法的价值已经变得无比重要。 符号化数据通常显示出带有隐含的层次化结构的特点, 比如,所有的海豚都是哺乳动物,所有哺乳动物都是动物, 所有动物都是生命,等等。如果能够捕捉到这种层次化的结构, 人工智能的许多核心问题都可以因此获益,比如对继承的推理, 或者建模复杂关系。这篇论文中,作者们提出了一种用于表征学习的新方法, 它可以同时提取层次化结构和相似性的信息。 他们的做法是改变了背后的嵌入空间的几何结构, 并且提出了一种高效的算法来学习这些层次化嵌入。)

5.1.2.2. Imbalance

5.1.2.2.1. Focus Loss -- deal with imbalance (被分類正確的classes會隨者正確次數上升而降低重要程度)

5.1.2.3. Mode-base decision-making

5.1.2.3.1. Visual interaction networks: Learning a physics simulator from video (能够从几帧视频中推断出多个物理对象的状态,然后用这些来预测对象位置。 它还能够推断隐形物体的位置,并学习依赖于物体属性(如质量)的动力原理。)

5.1.2.4. Memory

5.1.2.5. Using planing-based method to generate more training data for mode-free methods.

5.1.2.6. Robust imitation of diverse behaviors (NIPS'17) (可以编码一个单一的观察动作,并根据这个演示创建一个新的小动作。 它也可以在不同类型的行为(如步行的不同风格)之间切换, 尽管之前并没有看到过这种切换。)

5.2. Other ALGs

5.2.1. Evolution Algorithm

5.2.1.1. Evolution Strategies as a Scalable Alternative to Reinforcement Learning

5.2.2. RNNs

5.3. Other RL Topics

5.3.1. Robotic simulators

5.3.2. Real room for a wheeled robot

5.3.3. A natural image being captioned by a neural network that uses reinforcement learning to choose where to look

5.4. Other Goals

5.4.1. Transfer Learning

5.4.2. Learning to Reason

5.4.2.1. Relation Finding

5.4.2.1.1. Visual relation (between objects)

5.4.2.1.2. Predicate learner?

5.4.2.2. QA

5.4.2.2.1. VQA

5.4.2.2.2. Question Generation

5.4.3. Learning to Prediction

5.4.3.1. Learning to Act by Predicting the Future

5.4.4. Learning to Learn (Meta Learning) 让AI拥有核心价值观从而实现快速学习

5.4.4.1. 基于记忆Memory的方法 -- 通过以往的经验来学习

5.4.4.1.1. Meta-learning with memory-augmented neural networks, 2016

5.4.4.1.2. Meta Networks, 2017

5.4.4.2. 基于预测梯度的方法 -- Meta Learning的目的是实现快速学习, 而快速学习的关键一点是神经网络的梯度下降要准,要快, 那么是不是可以让神经网络利用以往的任务学习如何预测梯度, 这样面对新的任务,只要梯度预测得准,那么学习得就会更快了

5.4.4.2.1. Learning to learn by gradient descent by gradient descent, 2016

5.4.4.3. 预测Loss的方法 -- 基本思路:要让学习的速度更快,除了更好的梯度, 如果有更好的loss,那么学习的速度也会更快, 因此,是不是可以构造一个模型利用以往的任务来学习如何预测Loss呢?

5.4.4.3.1. Learning to Learn: Meta-Critic Networks for Sample Efficient Learning, 2017

5.4.4.4. 利用Attention注意力机制的方法 -- 基本思路:人的注意力是可以利用以往的经验来 实现提升的,比如我们看一个性感图片, 我们会很自然的把注意力集中在关键位 置。那么,能不能利用以往的任务来训练 一个Attention模型,从而面对新的任务, 能够直接关注最重要的部分。

5.4.4.4.1. Matching networks for one shot learning, 2016

5.4.4.5. 面向RL的Meta Learning方法 -- 额外增加reward和之前action的输入, 从而强制让神经网络学习一些任务级别的信息

5.4.4.5.1. Learning to reinforcement learn. 2016

5.4.4.5.2. Rl2: Fast reinforcement learning via slow reinforcement learning, 2016

5.4.4.6. 通过训练一个好的base model的方法,并且同时应用到监督学习和增强学习

5.4.4.6.1. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, 2017

5.4.4.7. 利用WaveNet的方法 -- 基本思路:WaveNet的网络每次都利用了 之前的数据,那么是否可以照搬WaveNet 的方式来实现Meta Learning呢?就是充分 利用以往的数据呀?

5.4.4.7.1. Meta-Learning with Temporal Convolutions. 2017

5.4.5. Decrease the data needed for training (減少樣本的需求)

5.4.5.1. 借鉴LSTM的方法 -- 基本思路:LSTM内部的更新非常类似于梯 度下降的更新,那么,能否利用LSTM的结 构训练出一个神经网络的更新机制,输入 当前网络参数,直接输出新的更新参数? 这个想法非常巧妙。

5.4.5.1.1. Optimization as a model for few-shot learning

5.4.5.2. Limitation Learning

5.4.5.3. Few Shot Learning (少樣本學習)

6. Methods / Algorithms

6.1. Value Functions

6.1.1. DQN

6.1.1.1. Dual DQN

6.1.1.2. Double-Q learning

6.2. Policy Search

6.2.1. Policy gradients

6.2.1.1. DPGs - Deterministic Policy Gradients

6.2.2. GPS - Guided Policy Search

6.2.3. Actor-Critic Methods

6.2.3.1. A3C

6.2.3.2. A2C

6.3. Model-base RL

6.3.1. Learning models and policies from pixel information

6.3.1.1. From Pixels to Torques: Policy Learning with Deep Dynamical Models, ICML'15 Workshop

6.3.1.2. Deep Dynamical Models from Image Pixels, IFAC SYSID 2015

6.3.1.3. Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images, NIPS'15

6.3.1.4. Action-Conditional Video Prediction using Deep Networks in Atari Games, NIPS'15

6.3.1.5. Deep Spatial Autoencoders for Visuomotor Learning, ICRA'16

6.3.1.6. Incentivizing Exploration in Reinforcement Learning with Deep Predictive Models, NIPS'15 Workshop

6.4. IRL - Inverse RL

6.5. Others

6.5.1. Guided by Natural Language

6.5.1.1. Beating Atari with Natural Language Guided Reinforcement Learning

7. Platform

7.1. TORCS car racing simulator

7.2. ELF: An Extensive, Lightweight and Flexible Research Platform for Real-time Strategy Games

7.3. Benchnarks