  # Q-Learning in Python

File handling | NumPy | Python Methods and Functions

Strengthening Training briefly — it is a paradigm of the educational process, in which the training agent learns over time to behave optimally in a certain environment, continuously interacting in this environment. During the learning process, the agent experiences various situations in the environment in which he finds himself. It is called. The agent, being in this state, can choose from a set of valid actions that can cause different (or penalties). An overtime training agent learns to maximize these rewards in order to behave optimally in any given state.

Q-Learning — is a basic form of Reinforcement Learning that uses Q-Values ​​(also called Action Values) to iteratively improve the behavior of the Learning Agent.

1. Q-Values ​​or Action Values: Q- values ​​are defined for states and actions. evaluation of how well to act in the state , This estimate will be iteratively calculated using TD-Update rules, which we will see in the following sections.
2. Rewards and episodes. An agent during its life cycle starts from the initial state, completes a series transitions from its current state to the next state based on its choice of action, as well as the environment in which the agent interacts. At each stage In the transition period, the agent performs an action from the state, observes the reward from the environment, and then transitions to another state. If at any moment of time the agent enters one of the completion states, this means that further transition is impossible. This is said to be the end of the episode.
3. Time Difference or TD Update:

A Time Difference Rule or TD-Update can be represented as follows: This update rule for estimating the Q value is applied at each time step of the agent`s interaction with the environment. The terms used are explained below. :

• : Current state of the agent.
• : The current action is selected by some policy.
• : The next state the agent is in.
• : Next best action to be chosen using the current estimate Q-values, i.e. select the action with the highest Q value in the next state.
• : Current reward observed from the environment in response to current action.
• (& gt; 0 and & lt; = 1): discount factor for future rewards. Future rewards are less valuable than current rewards and should be discounted. Since the Q-value is an estimate of the expected government reward, the discounting rule applies here as well.
• : The length of the step taken to update the Q (S, A) score.
4. Select an action to use greedy policy: -greedy policy — a very simple policy of choosing actions using current Q-value estimates. It looks like this:

• With probability select the action with the highest Q value.
• >
• Likely choose any action at random.
5. Now that the whole theory is in hand, let`s look at an example. We will be using the OpenAI gym to train our Q-Learning model.

Command to install `gym `

` pip install gym `

Before starting with the example, you will need some supporting code to visualize the algorithms. There will be two auxiliary files to download in the working directory. You can find the files here .

Step # 1: Import required libraries.

 ` import ` ` gym ` ` import ` ` itertools ` ` import ` ` matplotlib ` ` import ` ` matplotlib.style ` ` import ` ` numpy as np ` ` import ` ` pandas as pd ` ` import ` ` sys `     ` from ` ` collecti ons ` ` import ` ` defaultdict ` ` from windy_gridworld import WindyGridworldEnv `` import plotting   matplotlib.style.use ( `ggplot` ) `

Step # 2: Create a gym environment.

` `

` env = WindyGridworldEnv () `

Step # 3: Make greedy policy.

` `

` def createEpsilonGreedyPolicy (Q, epsilon, num_actions): "" "   Creates an epsilon-greedy policy based on on the given Q- function and epsilon.   Returns a function that takes state as input and returns probabilities for each action as an array e action space lines (many possible actions). "" " def policyFunction (state):   Action_probabilities = np.ones (num_actions, dtype = float ) * epsilon / num_actions   best_action = np.argmax (Q [s tate]) Action_probabilities [best_action] + = ( 1.0 - epsilon)   return Action_probabilities     return policyFunction `

Step # 4: Build the Q-Learning model.

` `

` def qLearning (env, num_episodes, discount_factor = 1.0 ,   alpha = 0.6 , epsilon = 0.1 ):   “ »» Q-Learning Algorithm: Controlling TD outside of Policy. Finds optimal greedy policy when improved following the epsilon greedy policy of & quot; & quot; & quot;   # Function value function # Nested dictionary that renders # state - & gt; (action is & gt; action value). Q = defaultdict ( lambda : np.zeros (env .action_space.n))   # Tracks useful statistics stats = plotting.EpisodeStats ( episode_lengths = np.zeros (num_episodes), episode_rewards = np.zeros (num_episodes))      # Create an epsilon greedy policy function # respectively for the space action environment policy = createEpsilonGreedyPolicy (Q, epsilon, env.action_space.n)     # For each episode for ith_episode in range (num_episodes):    # Reset environment and select first action   state = env.reset ()   for t in itertools.count ():   # get the probabilities of all actions from the current state action_probabilities = policy (state)   # select an action according to # probability distribution   action = np.random.choice (np.arange ( len (action_probabilities)), p = action_probabilities)   # take action and get reward, go to next state next_state, reward, done, _ = env.step (action)   # Update statistics stats.episode_rewards [i _episode] + = reward stats.episode_lengths [i_episode] = t   # TD Update best_next_action = np.argmax (Q [next_state])    td_target = reward + discount_factor * Q [next_state] [best_next_action] td_delta = td_target - Q [state] [action] Q [state] [action] + = alpha * td_delta      # done - True if the episode ends if done: break   state = next_state     return Q, stats `

` `

Step # 5: Train the model.

` `

` Q, stats = qLearning (env, 1000 ) `

Step 6: Compile important statistics.

 ` plotting.plot_episode_stats (stats) `   Output:
We can see that in the episode reward over time graph, the episode reward gradually increases from over time and ultimately flattens out when the episode reward is high, indicating that the agent has learned to maximize the total reward received in the episode through optimal behavior in each episode. state.