Strengthening Training briefly — it is a paradigm of the educational process, in which the training agent learns over time to behave optimally in a certain environment, continuously interacting in this environment. During the learning process, the agent experiences various situations in the environment in which he finds himself. It is called. The agent, being in this state, can choose from a set of valid actions that can cause different (or penalties). An overtime training agent learns to maximize these rewards in order to behave optimally in any given state.
Q-Learning — is a basic form of Reinforcement Learning that uses Q-Values (also called Action Values) to iteratively improve the behavior of the Learning Agent.
- Q-Values or Action Values: Q- values are defined for states and actions.
evaluation of how well to act
in the state
, This estimate
will be iteratively calculated using TD-Update rules, which we will see in the following sections.
- Rewards and episodes. An agent during its life cycle starts from the initial state, completes a series transitions from its current state to the next state based on its choice of action, as well as the environment in which the agent interacts. At each stage In the transition period, the agent performs an action from the state, observes the reward from the environment, and then transitions to another state. If at any moment of time the agent enters one of the completion states, this means that further transition is impossible. This is said to be the end of the episode.
- Time Difference or TD Update:
A Time Difference Rule or TD-Update can be represented as follows:
This update rule for estimating the Q value is applied at each time step of the agent’s interaction with the environment. The terms used are explained below. :
-
: Current state of the agent.
-
: The current action is selected by some policy.
-
: The next state the agent is in.
-
: Next best action to be chosen using the current estimate Q-values, i.e. select the action with the highest Q value in the next state.
-
: Current reward observed from the environment in response to current action.
-
(& gt; 0 and "= 1): discount factor for future rewards. Future rewards are less valuable than current rewards and should be discounted. Since the Q-value is an estimate of the expected government reward, the discounting rule applies here as well.
-
: The length of the step taken to update the Q (S, A) score.
-
- Select an action to use
greedy policy:
-greedy policy — a very simple policy of choosing actions using current Q-value estimates. It looks like this:
- With probability
select the action with the highest Q value.
> - Likely
choose any action at random.
- With probability
Now that the whole theory is in hand, let’s look at an example. We will be using the OpenAI gym to train our Q-Learning model.
Command to install gym
—
pip install gym
Before starting with the example, you will need some supporting code to visualize the algorithms. There will be two auxiliary files to download in the working directory. You can find the files here .
Step # 1: Import required libraries.
|
Step # 2: Create a gym environment.
|
Step # 3: Make greedy policy.
|
Step # 4: Build the Q-Learning model.
|
Step # 5: Train the model.
|
Step 6: Compile important statistics.
|
Output:
We can see that in the episode reward over time graph, the episode reward gradually increases from over time and ultimately flattens out when the episode reward is high, indicating that the agent has learned to maximize the total reward received in the episode through optimal behavior in each episode. state.