1. Applications of RL
1.1. control
1.2. Robotics
1.3. Business
1.4. Manufacturing
1.5. Finance sector
1.6. Chemistry
1.7. Game playing
2. RL platforms
2.1. OpenAI Gym and Universe
2.2. DeepMind Lab
2.3. RL-Glue
2.4. Project Maimo
2.5. ViZDoom
3. Markov Decision Process
3.1. Almost All RL problems are modeled as MDP
3.2. mathematical framework for solving the reinforcement learning
3.3. Sequentical decision making under uncertainly
3.4. markov property
3.4.1. future depends on present state
3.5. Markov Chain
3.5.1. probabilistic model
3.5.2. strictly ffollows the markov property
3.5.3. it has only chains
3.6. MDP has states,feedback and decision making
3.7. MDP transition graph
3.7.1. types
3.7.1.1. state node
3.7.1.2. action node
3.8. Elements
3.8.1. Set of states
3.8.2. Set of Actions
3.8.3. Transition Probability
3.8.4. Reward probability
3.8.5. Discount Factor
3.8.5.1. discount factor of 0;immediate rewards
3.8.5.2. discount factor of 1;may lead to infinity
3.8.5.3. optimal value; discount factor lies between 0.2 to0.8
4. Value Functions
4.1. Types
4.1.1. State value
4.1.2. Action value
4.1.2.1. Softmax action selection
4.1.2.2. Epsilon greedy action selection
4.1.2.3. Tracking an non stationary problem
4.1.2.4. Optimizing initial value
4.1.2.5. Upper confidence bound action selection
4.2. all expected values
5. Elements og RL
5.1. Policy
5.1.1. What to do
5.2. Reward
5.2.1. What is good
5.3. Value
5.3.1. What is good because it predicts reward
5.4. Model
5.4.1. What follows that
6. Key features
6.1. Learner is not told which actions to take
6.2. Trial-and-Error search
6.3. need to explore and exploit
6.4. Possibility of delayed reward
7. n-armed bandit problem
7.1. Non-Associative Learning
7.2. Find one of the best output
7.3. Evaluative Feedback
7.4. Depends totally on the action taken
8. Definition; It is learning best actions based on reward or punishment
9. Types of RL Environment
9.1. Deterministic environment
9.1.1. Outcomes based on the current state
9.2. Stochastic environment
9.2.1. Cannot determine the outcome based on the current state
9.3. Fully observable environment
9.3.1. Agent can determine state of the system at all times
9.4. partially observation environment
9.4.1. Agent cannot determine state of the system at all times
9.5. Episodic
9.5.1. Non-sequential environment; an agents current action will not affect the future action
9.6. Non-episodic
9.6.1. Sequential environment; an agents current action will affect the future action
9.7. Discrete Environment
9.7.1. There is only a finite state of actions available for moving from one state to another
9.8. Continuous environment
9.8.1. There is an infinite state of actions available for moving from one state to another
9.9. Single and multi agent environment
9.9.1. * It has only a single agent *It has multiple agents and extensively used while performing complex tasks.Stochastic as it has greater level of uncertainity
10. Rewards and returns
10.1. Sum of all reward is return
10.2. The reward has postive and negative reward
11. Bandit problem
11.1. Explore-Exploit Dilemma
11.1.1. Exploration
11.1.1.1. Finding information about an environment
11.1.1.2. Improve knowledge for long-term benefit
11.1.1.3. Eg:play an exprimental move
11.1.2. Exploitation
11.1.2.1. Exploiting already known information to maximize the rewards
11.1.2.2. Improve knowledge for short-term benefit
11.1.2.3. eg:play the move you belive is best
11.2. Learning Automata
11.3. Exploration Schemes
11.4. Types
11.4.1. Multi-armed bandit
11.4.1.1. Classical problem in RL
11.4.1.2. Reward as soon as an action is performed
11.4.1.3. MAB is a slot machine, Multiple slot machines is called multi-armed bandit or k-armed bandit
11.4.1.4. Applications
11.4.1.4.1. AB testing(one of the commonly used classical methods for testing)
11.4.1.4.2. Website optimisation
11.4.1.4.3. Maximizing conversion rate
11.4.1.4.4. Online advertisements
11.4.1.4.5. Campaigning
11.4.2. One armed bandit
11.4.2.1. A single slot machine
11.4.2.2. Non-associative state is not considered
12. APPROXIMATION
12.1. The agent's can only approximate to varying degrees
12.2. A large amount of memory is required to build a approximation
13. Optimality
13.1. A well defined notation of optimality organizes the approaches of learning
13.2. Bellman equation
13.2.1. decomposes the value function into two parts, the immediate reward plus the discounted future values