Q-Learning in Python



Strengthening Training briefly — it is a paradigm of the educational process, in which the training agent learns over time to behave optimally in a certain environment, continuously interacting in this environment. During the learning process, the agent experiences various situations in the environment in which he finds himself. It is called. The agent, being in this state, can choose from a set of valid actions that can cause different (or penalties). An overtime training agent learns to maximize these rewards in order to behave optimally in any given state.

Q-Learning — is a basic form of Reinforcement Learning that uses Q-Values ​​(also called Action Values) to iteratively improve the behavior of the Learning Agent.

  1. Q-Values ​​or Action Values: Q- values ​​are defined for states and actions.  evaluation of how well to act in the state , This estimate will be iteratively calculated using TD-Update rules, which we will see in the following sections.
  2. Rewards and episodes. An agent during its life cycle starts from the initial state, completes a series transitions from its current state to the next state based on its choice of action, as well as the environment in which the agent interacts. At each stage In the transition period, the agent performs an action from the state, observes the reward from the environment, and then transitions to another state. If at any moment of time the agent enters one of the completion states, this means that further transition is impossible. This is said to be the end of the episode.
  3. Time Difference or TD Update:

    A Time Difference Rule or TD-Update can be represented as follows:

    This update rule for estimating the Q value is applied at each time step of the agent`s interaction with the environment. The terms used are explained below. :

    • : Current state of the agent.
    • : The current action is selected by some policy.
    • : The next state the agent is in.
    • : Next best action to be chosen using the current estimate Q-values, i.e. select the action with the highest Q value in the next state.
    • : Current reward observed from the environment in response to current action.
    • (& gt; 0 and & lt; = 1): discount factor for future rewards. Future rewards are less valuable than current rewards and should be discounted. Since the Q-value is an estimate of the expected government reward, the discounting rule applies here as well.
    • : The length of the step taken to update the Q (S, A) score.
  4. Select an action to use greedy policy:

    -greedy policy — a very simple policy of choosing actions using current Q-value estimates. It looks like this:

    • With probability select the action with the highest Q value.
    • >

    • Likely choose any action at random.
  5. Now that the whole theory is in hand, let`s look at an example. We will be using the OpenAI gym to train our Q-Learning model.

    Command to install gym

     pip install gym 

    Before starting with the example, you will need some supporting code to visualize the algorithms. There will be two auxiliary files to download in the working directory. You can find the files here .

    Step # 1: Import required libraries.

    import gym

    import itertools

    import matplotlib

    import matplotlib.style

    import numpy as np

    import pandas as pd

    import sys

     

     

    from collecti ons import defaultdict

    from windy_gridworld import WindyGridworldEnv

    import plotting

     

    matplotlib.style.use ( `ggplot` )

    Step # 2: Create a gym environment.

    env = WindyGridworldEnv ()

    Step # 3: Make greedy policy. 

    def createEpsilonGreedyPolicy (Q, epsilon, num_actions):

    "" "

      Creates an epsilon-greedy policy based on

    on the given Q- function and epsilon.

     

    Returns a function that takes state

    as input and returns probabilities

    for each action as an array

    e action space lines (many possible actions).

    "" "

    def policyFunction (state):

     

    Action_probabilities = np.ones (num_actions,

    dtype = float ) * epsilon / num_actions

     

    best_action = np.argmax (Q [s tate])

    Action_probabilities [best_action] + = ( 1.0 - epsilon)

      return Action_probabilities

     

      return policyFunction

    Step # 4: Build the Q-Learning model.

    def qLearning (env, num_episodes, discount_factor = 1.0 ,

      alpha = 0.6 , epsilon = 0.1 ):

      “ »»

    Q-Learning Algorithm: Controlling TD outside of Policy.

    Finds optimal greedy policy when improved

    following the epsilon greedy policy of & quot; & quot; & quot;

     

    # Function value function

    # Nested dictionary that renders

    # state - & gt; (action is & gt; action value).

    Q = defaultdict ( lambda : np.zeros (env .action_space.n))

     

    # Tracks useful statistics

    stats = plotting.EpisodeStats (

    episode_lengths = np.zeros (num_episodes),

    episode_rewards = np.zeros (num_episodes)) 

     

      # Create an epsilon greedy policy function

    # respectively for the space action environment

    policy = createEpsilonGreedyPolicy (Q, epsilon, env.action_space.n)

     

      # For each episode

    for ith_episode in range (num_episodes):

      

    # Reset environment and select first action

      state = env.reset ()

     

    for t in itertools.count ():

     

    # get the probabilities of all actions from the current state

    action_probabilities = policy (state)

     

    # select an action according to

    # probability distribution

      action = np.random.choice (np.arange (

    len (action_probabilities)),

    p = action_probabilities)

     

    # take action and get reward, go to next state

    next_state, reward, done, _ = env.step (action)

     

    # Update statistics

    stats.episode_rewards [i _episode] + = reward

    stats.episode_lengths [i_episode] = t

     

    # TD Update

    best_next_action = np.argmax (Q [next_state]) 

      td_target = reward + discount_factor * Q [next_state] [best_next_action]

    td_delta = td_target - Q [state] [action]

    Q [state] [action] + = alpha * td_delta

      

      # done - True if the episode ends

    if done:

    break

     

    state = next_state

     

      return Q, stats

    Step # 5: Train the model.

    Q, stats = qLearning (env, 1000 )

    Step 6: Compile important statistics.

    plotting.plot_episode_stats (stats)

    Output:
    We can see that in the episode reward over time graph, the episode reward gradually increases from over time and ultimately flattens out when the episode reward is high, indicating that the agent has learned to maximize the total reward received in the episode through optimal behavior in each episode. state.