Strengthening Training briefly — it is a paradigm of the educational process, in which the training agent learns over time to behave optimally in a certain environment, continuously interacting in this environment. During the learning process, the agent experiences various situations in the environment in which he finds himself. It is called. The agent, being in this state, can choose from a set of valid actions that can cause different (or penalties). An overtime training agent learns to maximize these rewards in order to behave optimally in any given state.
QLearning — is a basic form of Reinforcement Learning that uses QValues (also called Action Values) to iteratively improve the behavior of the Learning Agent.
A Time Difference Rule or TDUpdate can be represented as follows:
This update rule for estimating the Q value is applied at each time step of the agent`s interaction with the environment. The terms used are explained below. :
greedy policy — a very simple policy of choosing actions using current Qvalue estimates. It looks like this:
>
Now that the whole theory is in hand, let`s look at an example. We will be using the OpenAI gym to train our QLearning model.
Command to install gym
—
pip install gym
Before starting with the example, you will need some supporting code to visualize the algorithms. There will be two auxiliary files to download in the working directory. You can find the files here .
Step # 1: Import required libraries.

Step # 2: Create a gym environment.
env
=
WindyGridworldEnv ()
Step # 3: Make greedy policy.
def
createEpsilonGreedyPolicy (Q, epsilon, num_actions):
"" "
Creates an epsilongreedy policy based on
on the given Q function and epsilon.
Returns a function that takes state
as input and returns probabilities
for each action as an array
e action space lines (many possible actions).
"" "
def
policyFunction (state):
Action_probabilities =
np.ones (num_actions,
dtype
=
float
)
*
epsilon
/
num_actions
best_action
=
np.argmax (Q [s tate])
Action_probabilities [best_action]
+
=
(
1.0

epsilon)
return
Action_probabilities
return
policyFunction
Step # 4: Build the QLearning model.
def
qLearning (env, num_episodes, discount_factor
=
1.0
,
alpha
=
0.6
, epsilon
=
0.1
):
“ »»
QLearning Algorithm: Controlling TD outside of Policy.
Finds optimal greedy policy when improved
following the epsilon greedy policy of & quot; & quot; & quot;
# Function value function
# Nested dictionary that renders
# state  & gt; (action is & gt; action value).
Q
=
defaultdict (
lambda
: np.zeros (env .action_space.n))
# Tracks useful statistics
stats
=
plotting.EpisodeStats (
episode_lengths
=
np.zeros (num_episodes),
episode_rewards
=
np.zeros (num_episodes))
# Create an epsilon greedy policy function
# respectively for the space action environment
policy
=
createEpsilonGreedyPolicy (Q, epsilon, env.action_space.n)
# For each episode
for
ith_episode
in
range
(num_episodes):
# Reset environment and select first action
state
=
env.reset ()
for
t
in
itertools.count ():
# get the probabilities of all actions from the current state
action_probabilities
=
policy (state)
# select an action according to
# probability distribution
action
=
np.random.choice (np.arange (
len
(action_probabilities)),
p
=
action_probabilities)
# take action and get reward, go to next state
next_state, reward, done, _
=
env.step (action)
# Update statistics
stats.episode_rewards [i _episode]
+
=
reward
stats.episode_lengths [i_episode]
=
t
# TD Update
best_next_action
=
np.argmax (Q [next_state])
td_target
=
reward
+
discount_factor
*
Q [next_state] [best_next_action]
td_delta
=
td_target

Q [state] [action]
Q [state] [action]
+
=
alpha
*
td_delta
# done  True if the episode ends
if
done:
break
state
=
next_state
return
Q, stats
Step # 5: Train the model.
Q, stats
=
qLearning (env,
1000
)
Step 6: Compile important statistics.

Output:
We can see that in the episode reward over time graph, the episode reward gradually increases from over time and ultimately flattens out when the episode reward is high, indicating that the agent has learned to maximize the total reward received in the episode through optimal behavior in each episode. state.