SARSA Strengthening Learning

The SARSA algorithm is a small variation of the popular Q-Learning algorithm. For the training agent in any reinforcement learning algorithm, its policy can be of two types:

  1. About the policy: in this, the training agent learns the value function in accordance with the current action obtained from the currently used policy.
  2. A disabled policy. In this case, the training agent learns the value function according to the action obtained from another policy.

Q-Learning is a out of politics technique and uses a greedy approach to learning the Q-value. The SARSA technique, on the other hand, is a policy enabled and uses the action taken by the current policy to examine the Q-value.

This difference is visible in the difference between update statements for each method:

  1. Q-Learning:
  2. Sarsa:

Here the update equation for SARSA depends on the current state, the current action, reward received, next state, and next action. This observation leads to the naming of the learning technique, since SARSA stands for State Action Reward State Action, which symbolizes the tuple (s, a, r, s & # 39;, a & # 39;).

The following Python code demonstrates how to implement the SARSA algorithm, using the OpenAI gym module to load the environment.

Step 1: Import the required libraries

import numpy as np

import gym

Step 2: Create Environment

Here we will use the FrozenLake-v0 environment preloaded in the gym. You can read about the environment description here .

# Create environment

env = gym.make ( `FrozenLake-v0` )

Step 3: Initialize various parameters

# Define different parameters

epsilon = 0.9

total_episodes = 10000

max_steps  = 100

alpha = 0.85

gamma = 0.95

# Q-matrix initialization

Q = np.zeros ((env.observation_space .n, env.action_space.n))

Step 4: Define the utility functions to be used in the learning process

# Next action selection function

def choose_action (state):

action = 0

if np.random.uniform ( 0 , 1 ) & lt; epsilon:

action = env.action_space.sample ()

else :

action = np.argmax (Q [state,:])

return action

# Function to find out the Q value

def update (state, state2, reward, action, action2):

  predict = Q [state, action]

  target = reward + gamma * Q [state2, action2]

Q [state, action] = Q [state, action] + alpha * (target - predict)

Step 5: Train the Learning Agent

# Initializing reward

reward = 0

# Start learning SARSA

for episode in range (total_episodes):

t = 0

state1 = env.reset ()

action1 = choose_action (state1)


while t & lt; max_steps:

# Learning visualization

env.render ()


# Get next state

state2, reward, done, info = env.step (action1)


# Select next action

  action2 = choose_action (state2)


# Examining the Q-value

  update (state1, state2, reward, action1, action2)


  state1 = state2

action1 = action2


# Update the corresponding values ​​

t + = 1

reward + = 1


  # If at the end of the learning process

if done:


In the above output, the red mark defines the current position of the agent in environment, while the direction shown in parentheses indicates the direction of movement that the agent will take next. Note that the agent remains in position if it goes beyond.

Step 6: Evaluate performance

# Performance evaluation

print ( "Performace:" , reward / total_episodes)

# Q-matrix rendering

print (Q)