Explanation of the fundamental functions involved in the A3C algorithm



While any implementation of the Asynchronous Advantage Actor Critic algorithm must be complex, all implementations will have one thing in common — the presence of a global network and a working class.

  1. The Global Network: class contains all the necessary operations Tensorflow for autonomous creation of neural networks.
  2. Working class. This class is used to simulate the process of training an employee who has his own copy of the environment and a “personal” neural network.

The following implementation will require the following modules:

  1. Numpy
  2. Tensorflow
  3. Multiprocessing
  4. The following lines of code indicate the basic functionality required to create the corresponding class.

    WAN class:

    # Network class definition

    class AC_Network ():

    The following lines contain various functions that describe the member functions of the class defined above.

    Class initialization:

    # Class initialization

    def __ init __ ( self , s_size, a_size, scope, trainer):

    with tf.variable_scope (scope):

     

    # Input and visual coding of layers

      self . inputs = tf.placeholder (shape = [ None , s_size], 

    dtype = tf.float32)

    self . imageIn = tf.reshape ( self . inputs, shape = [ - 1 , 84 , 84 , 1 ])

      self . conv1 = slim.conv2d (activation_fn = tf.nn.elu,

      inputs = self . imageIn, num_outputs = 16 ,

    kernel_size = [ 8 , 8 ],

    stride = [ 4 , 4 ], padding = `VALID` )

    self . conv2 = slim.conv2d ( activation_fn = tf.nn.elu,

    inputs = self . conv1, num_outputs = 32 ,

    kernel_size = [ 4 , 4 ],

    stride = [ 2 , 2 ], padding = `VALID` )

    hidden = slim .fully_connected (slim.flatten ( self . conv2),

    256 , activation_fn = tf.nn.elu)

     - & gt;  tf.placeholder ()  - Inserts a placeholder for a tensor that will always be fed. - & gt;  tf.reshape ()  - Reshapes the input tensor - & gt;  slim.conv2d ()  - Adds an n-dimensional convolutional network - & gt;  slim.fully_connected ()  - Adds a fully connected layer 

    Note the following definitions:

    • Filter: this is a small matrix that is used to apply various effects to a given image.
    • Padding: is the process of adding an extra row or column at the edges of an image to fully compute filter convolution values.
    • Step: is the number of steps after which the filter is set to a pixel in the given direction.

    Recurrent construction networks:

    def __ init __ ( self , s_size, a_size, scope, trainer):

    with tf.variable_scope (scope) :

    . ... ... ... ... ... ... ... ... ... 

    . ... ... ... ... ... ... ... ... ...

    . ... ... ... ... ... ... ... ... ...

     

    # Building a recurrent network for temporal dependencies

      lstm_cell = tf.nn.rnn_cell.BasicLSTMCell ( 256 , state_is_tuple = True )

    c_init = np .zeros (( 1 , lstm_cell.state_size.c), np.float32)

    h_init = np .zeros (( 1 , lstm_cell.state_size.h), np.float32)

    self . state_init = [c_init, h_init]

      c_Init = tf.placeholder (tf.float32, [ 1 , lstm_cell.state_size.c])

      h_Init = tf.placeholder (tf.float32, [ 1 , lstm_cell.state_size.h])

      self . state_Init = ( c_Init, h_Init)

    rnn_init = tf.expand_dims (hidden, [ 0 ])

    step_size = tf.shape ( self . imageIn) [: 1 ]

    state_Init = tf.nn.rnn_cell.LSTMStateTuple (c_Init, h_Init)

    lstm_outputs, lstm_state = tf.nn.dynamic_rnn (lstm_cell, rnn_init,

    initial_state = state_Init,

    sequence _length = step_size,

    time_major = False )

    lstm_c, lstm_h = lstm_state

    self . state_out = (lstm_c [: 1 ,:], lstm_h [: 1 ,:])

    rnn_out = tf.reshape (lstm_outputs, [ - 1 , 256 ])

     - & gt;  tf.nn.rnn_cell.BasicLSTMCell ()  - Builds a basic LSTM Recurrent network cell - & gt;  tf.expand_dims ()  - Inserts a dimension of 1 at the dimension index axis of input`s shape - & gt;  tf.shape ()  - returns the shape of the tensor - & gt;  tf.nn.rnn_cell.LSTMStateTuple ()  - Creates a tuple to be used by the LSTM cells for state_size, zero_state and output state. - & gt;  tf.nn.dynamic_rnn ()  - Builds a Recurrent network according to the Recurrent network cell 

    Generate pricing and policy output layers:

    def __ init __ ( self , s_size, a_size, scope, trainer):

    with tf .variable_scope (scope):

    . ... ... ... ... ... ... ... ... ... 

    . ... ... ... ... ... ... ... ... ...

    . ... ... ... ... ... ... ... ... ...

     

    # Create output layers for cost and policy estimation

      self . policy = slim.fully_connected (rnn_out, a_size,

    activation_fn = tf .nn.softmax,

    weights_initializer = normalized_columns_initializer ( 0.01 ),

    biases_initializer = None )

      self . value = slim.fully_connected (rnn_out, 1 ,

    activation_fn = None ,

    weights_initializer = normalized_columns_initializer ( 1.0 ),

    biases_initializer = None )

    Build a master network and deploy workers:

    def __ init __ ( self , s_size, a_size, scope, trainer):

    with tf.variable_scope (scope):

    . ... ... ... ... ... ... ... ... ... 

    . ... ... ... ... ... ... ... ... ...

    . ... ... ... ... ... ... ... ... ...

    with tf.device ( " / cpu: 0 " ): 

      

    # Global network generation

    master_network = AC_Network (s_size, a_size, `global` , None )

     

    # Save the number of employees

    # as the number of available CPU threads

    num_workers = multiprocessing.cpu_count ()

     

    # Create and deploy workers

      workers = []

    for i in range (num_workers):

    workers .append (Worker (DoomGame (), i, s_size, a_size,

    trainer, saver , model_path))

    Performing parallel Tensorflow operations:

    def __ init __ ( self , s_size, a_size, scope, trainer):

    with tf.variable_scope (scope):

    . ... ... ... ... ... ... ... ... ... 

    . ... ... ... ... ... ... ... ... ...

    . ... ... ... ... ... ... ... ... ...

     

    with tf.Session () as sess:

      coord = tf.train.Coordinator ()

    if load_model = = True :

    ckpt = tf.train.get_checkpoint_state (model_path)

    saver.restore (sess, ckpt.model_checkpoint_path)

    else :

    sess.run (tf.global_variables_initializer ())

     

    worker_threads = []

      for worker in workers:

    worker_work = lambda : worker.work (max_episode_length,

      gamma, master_network, sess, coord)

    t = threading.Thread (target = (worker_work))

    t.start ()

    worker_threads.append (t)

    coord.join (worker_threads)

     - & gt;  tf.Session ()  - A class to run the Tensorflow operations - & gt;  tf.train.Coordinator ()  - Returns a coordinator for the multiple threads - & gt;  tf.train.get_checkpoint_state ()  - Returns a valid checkpoint state from the "checkpoint" file - & gt;  saver.restore ()  - Is used to store and restore the models - & gt;  sess.run ()  - Outputs the tensors and metadata obtained from running a session 

    Updating WAN parameters:

    def __ init __ ( self , s_size, a_size, scope, trainer):

    with tf.variable_scope (scope):

    . ... ... ... ... ... ... ... ... ... 

    . ... ... ... ... ... ... ... ... ...

    . ... ... ... ... ... ... ... ... ...

     

    if scope! = ` global` :

    self . actions = tf.placeholder (shape = [ None ], dtype = tf.int32)

    self . actions_onehot = tf.one_hot ( self . actions,

    a_size, dtype = tf.float32)

    self . target_v = tf.placeholder (shape = [ None ], dtype = tf.float32)

    self . advantages = tf.placeholder (shape = [ None ], dtype = tf.float32)

     

    self . respons ible_outputs = tf.reduce_sum ( self . policy *  

    self . actions_onehot, [ 1 ])

     

    # Calculation errors

    self . value_loss = 0.5 * tf .reduce_sum (tf.square ( self . target_v -

    tf.reshape ( self . value, [ - 1 ])) )

    self . entropy = - tf.reduce_sum ( self . policy * tf.log ( self . policy))

    self . policy_loss = - tf.reduce_sum (tf.log ( self . responsible_outputs)

    * self . advantages)

      self . loss = 0.5 * self . value_loss +  

    self . policy_loss - self . entropy * 0.01

      

    # Get gradients from LAN

    local_vars = tf.get_collection (tf.GraphKeys .TRAINABLE_VARIABLES,

    scope)

    self . gradients = tf.gradients ( self .loss, local_vars)

    self .var_norms = tf.global_norm (local_vars)

      grads, self . grad_norms = tf.clip_by_global_norm ( self . gradients,

    40.0 )

     

    # Apply local gradients to the global web

    global_vars = tf.get_collection (tf.GraphKeys.TRAINABLE_VARIABLES,

    `global` )

      self . apply_grads = trainer.apply_gradients (

    > .apply_grads = trainer.apply_gradients (

    > .apply_grads = trainer.apply_gradients (