Q-learning Mathematics education

Python Methods and Functions

The following derivations will use characters defined as in the required clause. 
Q-learning methodology is based on the Bellman equation .

E: pending
t + 1 : next state
: discount factor

To rephrase the above equation as a Q-value: —

Optimal Q value is defined as

Policy iteration: is the process of determining the optimal policy for the model, which consists of the following two steps:

  1. Policy Evaluation: This process evaluates the value long-term remuneration function waiting with greedy policy from the last step of policy enhancement.
  2. Policy enhancement: this process updates the policy with an action that maximizes V for each state. This process is repeated until convergence is achieved.

Steps included are —

  • Initialization:

    = any valid random number
    = any A (s) randomly selected

  • Policy Evaluation:
        while ( )  { for each s in S  {     }}   
  • Policy improvements:
        while (true)   for each s in S  {     if ( )     if ( )   break from both loo ps }  return V,    
  • Value iteration: this process updates the function V according to the optimal Bellman equation .

Working steps:

  • Initialization: initialize array V with any random real number.
  • Optimal value calculation:
        while ( )  { for each s in S  {     }}    return