Vollbild-Modus

Reinforcement Learning problem

Andere

sanjana kumar

Folgen

Jetzt loslegen. Gratis!

oder registrieren mit Ihrer E-Mail-Adresse

Ähnliche Mindmaps Mindmap-Gliederung

Reinforcement Learning problem von sanjana kumar Mind Map: Reinforcement Learning problem

1. Applications of RL

1.1. control

1.2. Robotics

1.3. Business

1.4. Manufacturing

1.5. Finance sector

1.6. Chemistry

1.7. Game playing

2. RL platforms

2.1. OpenAI Gym and Universe

2.2. DeepMind Lab

2.3. RL-Glue

2.4. Project Maimo

2.5. ViZDoom

3. Markov Decision Process

3.1. Almost All RL problems are modeled as MDP

3.2. mathematical framework for solving the reinforcement learning

3.3. Sequentical decision making under uncertainly

3.4. markov property

3.4.1. future depends on present state

3.5. Markov Chain

3.5.1. probabilistic model

3.5.2. strictly ffollows the markov property

3.5.3. it has only chains

3.6. MDP has states,feedback and decision making

3.7. MDP transition graph

3.7.1. types

3.7.1.1. state node

3.7.1.2. action node

3.8. Elements

3.8.1. Set of states

3.8.2. Set of Actions

3.8.3. Transition Probability

3.8.4. Reward probability

3.8.5. Discount Factor

3.8.5.1. discount factor of 0;immediate rewards

3.8.5.2. discount factor of 1;may lead to infinity

3.8.5.3. optimal value; discount factor lies between 0.2 to0.8

4. Value Functions

4.1. Types

4.1.1. State value

4.1.2. Action value

4.1.2.1. Softmax action selection

4.1.2.2. Epsilon greedy action selection

4.1.2.3. Tracking an non stationary problem

4.1.2.4. Optimizing initial value

4.1.2.5. Upper confidence bound action selection

4.2. all expected values

5. Elements og RL

5.1. Policy

5.1.1. What to do

5.2. Reward

5.2.1. What is good

5.3. Value

5.3.1. What is good because it predicts reward

5.4. Model

5.4.1. What follows that

6. Key features

6.1. Learner is not told which actions to take

6.2. Trial-and-Error search

6.3. need to explore and exploit

6.4. Possibility of delayed reward

7. n-armed bandit problem

7.1. Non-Associative Learning

7.2. Find one of the best output

7.3. Evaluative Feedback

7.4. Depends totally on the action taken

8. Definition; It is learning best actions based on reward or punishment

9. Types of RL Environment

9.1. Deterministic environment

9.1.1. Outcomes based on the current state

9.2. Stochastic environment

9.2.1. Cannot determine the outcome based on the current state

9.3. Fully observable environment

9.3.1. Agent can determine state of the system at all times

9.4. partially observation environment

9.4.1. Agent cannot determine state of the system at all times

9.5. Episodic

9.5.1. Non-sequential environment; an agents current action will not affect the future action

9.6. Non-episodic

9.6.1. Sequential environment; an agents current action will affect the future action

9.7. Discrete Environment

9.7.1. There is only a finite state of actions available for moving from one state to another

9.8. Continuous environment

9.8.1. There is an infinite state of actions available for moving from one state to another

9.9. Single and multi agent environment

9.9.1. * It has only a single agent *It has multiple agents and extensively used while performing complex tasks.Stochastic as it has greater level of uncertainity

10. Rewards and returns

10.1. Sum of all reward is return

10.2. The reward has postive and negative reward

11. Bandit problem

11.1. Explore-Exploit Dilemma

11.1.1. Exploration

11.1.1.1. Finding information about an environment

11.1.1.2. Improve knowledge for long-term benefit

11.1.1.3. Eg:play an exprimental move

11.1.2. Exploitation

11.1.2.1. Exploiting already known information to maximize the rewards

11.1.2.2. Improve knowledge for short-term benefit

11.1.2.3. eg:play the move you belive is best

11.2. Learning Automata

11.3. Exploration Schemes

11.4. Types

11.4.1. Multi-armed bandit

11.4.1.1. Classical problem in RL

11.4.1.2. Reward as soon as an action is performed

11.4.1.3. MAB is a slot machine, Multiple slot machines is called multi-armed bandit or k-armed bandit

11.4.1.4. Applications

11.4.1.4.1. AB testing(one of the commonly used classical methods for testing)

11.4.1.4.2. Website optimisation

11.4.1.4.3. Maximizing conversion rate

11.4.1.4.4. Online advertisements

11.4.1.4.5. Campaigning

11.4.2. One armed bandit

11.4.2.1. A single slot machine

11.4.2.2. Non-associative state is not considered

12. APPROXIMATION

12.1. The agent's can only approximate to varying degrees

12.2. A large amount of memory is required to build a approximation

13. Optimality

13.1. A well defined notation of optimality organizes the approaches of learning

13.2. Bellman equation

13.2.1. decomposes the value function into two parts, the immediate reward plus the discounted future values