Deciphering the different parts of the algorithm name:
- Asynchronous: unlike other popular deep learning algorithms such as Deep Q-Learning that use one agent and one environment, this algorithm uses multiple agents, each with its own network parameters and a copy of the environment. These agents interact with their respective environments asynchronously , learning with each interaction. Each agent is controlled by the global network. As each agent gains more knowledge, he contributes to the overall knowledge of the global network. The presence of a global network allows each agent to have more varied training data. This scheme simulates the real environment in which people live, since each person gains knowledge from the experience of some other person, which allows the entire “global network” to become better.
- Actor-Critic: in Unlike some simpler methods based on either Value-Iteration or Policy-Gradient methods, the A3C algorithm combines the best parts of both methods, i.e. the algorithm predicts both the function V (s) and the optimal function policy function , The training agent uses the value of the Value function to update the optimal policy function (Actor). Note that the policy function here stands for the probability distribution of the action space . More precisely, the training agent determines the conditional probability P (a | s; ) that is, the parameterized probability that the agent chooses action a in state s.
Advantage: As a rule of thumb, when implementing a gradient policy , the discounted income value () tell the agent which of his actions were helpful and which were fined. Using the Advantage value instead, the agent also knows how much better the reward was than expected. This gives the beginner an understanding of the agent in the environment, and therefore the learning process is better. Advantage metric is defined by the following expression:
Advantage: A = Q (s, a) — V (s)
The following pseudocode is from the research paper referenced above.
Define global shared parameter vectors and Define global shared counter T = 0 Define thread specific parameter vectors and Define thread step counter t = 1 while ( ) { while ( is not terminal ) {Simulate action according to Receive reward and next state t ++ T ++} if ( is terminal) {R = 0} else {R = } for (i = t-1; i & gt; = ; i--) {R = } }
Where,
— Maximum number of iterations
— change the global parameter vector
— Overall reward
— Political function
— Value function
— discount factor
Benefits :
- This algorithm is faster and more reliable than standard reinforcement learning algorithms.
- It performs better than other reinforcement learning methods due to the diversity of knowledge as described above.
- It can be used on both discrete and continuous action spaces.