Change language

Q-Learning in Python

Strengthening Training briefly — it is a paradigm of the educational process, in which the training agent learns over time to behave optimally in a certain environment, continuously interacting in this environment. During the learning process, the agent experiences various situations in the environment in which he finds himself. It is called. The agent, being in this state, can choose from a set of valid actions that can cause different (or penalties). An overtime training agent learns to maximize these rewards in order to behave optimally in any given state.

Q-Learning — is a basic form of Reinforcement Learning that uses Q-Values ​​(also called Action Values) to iteratively improve the behavior of the Learning Agent.

  1. Q-Values ​​or Action Values: Q- values ​​are defined for states and actions.  evaluation of how well to act in the state , This estimate will be iteratively calculated using TD-Update rules, which we will see in the following sections.
  2. Rewards and episodes. An agent during its life cycle starts from the initial state, completes a series transitions from its current state to the next state based on its choice of action, as well as the environment in which the agent interacts. At each stage In the transition period, the agent performs an action from the state, observes the reward from the environment, and then transitions to another state. If at any moment of time the agent enters one of the completion states, this means that further transition is impossible. This is said to be the end of the episode.
  3. Time Difference or TD Update:

    A Time Difference Rule or TD-Update can be represented as follows:

    This update rule for estimating the Q value is applied at each time step of the agent’s interaction with the environment. The terms used are explained below. :

    • : Current state of the agent.
    • : The current action is selected by some policy.
    • : The next state the agent is in.
    • : Next best action to be chosen using the current estimate Q-values, i.e. select the action with the highest Q value in the next state.
    • : Current reward observed from the environment in response to current action.
    • (& gt; 0 and "= 1): discount factor for future rewards. Future rewards are less valuable than current rewards and should be discounted. Since the Q-value is an estimate of the expected government reward, the discounting rule applies here as well.
    • : The length of the step taken to update the Q (S, A) score.
  4. Select an action to use greedy policy:

    -greedy policy — a very simple policy of choosing actions using current Q-value estimates. It looks like this:

    • With probability select the action with the highest Q value.
    • >
    • Likely choose any action at random.
  5. Now that the whole theory is in hand, let’s look at an example. We will be using the OpenAI gym to train our Q-Learning model.

    Command to install gym

     pip install gym 

    Before starting with the example, you will need some supporting code to visualize the algorithms. There will be two auxiliary files to download in the working directory. You can find the files here .

    Step # 1: Import required libraries.

    import gym

    import itertools

    import matplotlib


    import numpy as np

    import pandas as pd

    import sys



    from collecti ons import defaultdict

    from windy_gridworld import WindyGridworldEnv

    import plotting

  ( ’ggplot’ )

    Step # 2: Create a gym environment.

    env = WindyGridworldEnv ()

    Step # 3: Make greedy policy. 

    def createEpsilonGreedyPolicy (Q, epsilon, num_actions):

    "" "

      Creates an epsilon-greedy policy based on

    on the given Q- function and epsilon.


    Returns a function that takes state

    as input and returns probabilities

    for each action as an array

    e action space lines (many possible actions).

    "" "

    def policyFunction (state):


    Action_probabilities = np.ones (num_actions,

    dtype = float ) * epsilon / num_actions


    best_action = np.argmax (Q [s tate])

    Action_probabilities [best_action] + = ( 1.0 - epsilon)

      return Action_probabilities


      return policyFunction

    Step # 4: Build the Q-Learning model.

    def qLearning (env, num_episodes, discount_factor = 1.0 ,

      alpha = 0.6 , epsilon = 0.1 ):


    Q-Learning Algorithm: Controlling TD outside of Policy.

    Finds optimal greedy policy when improved

    following the epsilon greedy policy of & quot; & quot; & quot;


    # Function value function

    # Nested dictionary that renders

    # state -" (action is" action value).

    Q = defaultdict ( lambda : np.zeros (env .action_space.n))


    # Tracks useful statistics

    stats = plotting.EpisodeStats (

    episode_lengths = np.zeros (num_episodes),

    episode_rewards = np.zeros (num_episodes)) 


      # Create an epsilon greedy policy function

    # respectively for the space action environment

    policy = createEpsilonGreedyPolicy (Q, epsilon, env.action_space.n)


      # For each episode

    for ith_episode in range (num_episodes):


    # Reset environment and select first action

      state = env.reset ()


    for t in itertools.count ():


    # get the probabilities of all actions from the current state

    action_probabilities = policy (state)


    # select an action according to

    # probability distribution

      action = np.random.choice (np.arange (

    len (action_probabilities)),

    p = action_probabilities)


    # take action and get reward, go to next state

    next_state, reward, done, _ = env.step (action)


    # Update statistics

    stats.episode_rewards [i _episode] + = reward

    stats.episode_lengths [i_episode] = t


    # TD Update

    best_next_action = np.argmax (Q [next_state]) 

      td_target = reward + discount_factor * Q [next_state] [best_next_action]

    td_delta = td_target - Q [state] [action]

    Q [state] [action] + = alpha * td_delta


      # done - True if the episode ends

    if done:



    state = next_state


      return Q, stats

    Step # 5: Train the model.

    Q, stats = qLearning (env, 1000 )

    Step 6: Compile important statistics.

    plotting.plot_episode_stats (stats)

    We can see that in the episode reward over time graph, the episode reward gradually increases from over time and ultimately flattens out when the episode reward is high, indicating that the agent has learned to maximize the total reward received in the episode through optimal behavior in each episode. state.


Gifts for programmers

Learn programming in R: courses

Gifts for programmers

Best Python online courses for 2022

Gifts for programmers

Best laptop for Fortnite

Gifts for programmers

Best laptop for Excel

Gifts for programmers

Best laptop for Solidworks

Gifts for programmers

Best laptop for Roblox

Gifts for programmers

Best computer for crypto mining

Gifts for programmers

Best laptop for Sims 4


Latest questions


Common xlabel/ylabel for matplotlib subplots

1947 answers


Check if one list is a subset of another in Python

1173 answers


How to specify multiple return types using type-hints

1002 answers


Printing words vertically in Python

909 answers


Python Extract words from a given string

798 answers


Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

606 answers


Python os.path.join () method

384 answers


Flake8: Ignore specific warning for entire file

360 answers



Python | How to copy data from one Excel sheet to another

Common xlabel/ylabel for matplotlib subplots

Check if one list is a subset of another in Python

How to specify multiple return types using type-hints

Printing words vertically in Python

Python Extract words from a given string

Cyclic redundancy check in Python

Finding mean, median, mode in Python without libraries

Python add suffix / add prefix to strings in a list

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

Python - Move item to the end of the list

Python - Print list vertically