--------------------------------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------------------------------
understand that correlated samples cause problem. and how paralled solve the problem
another solution is replay buffers, fully ultilizing the advantage of off policy in Q-learning.
there‘s still a problem: Q learning is not gradient descent
divide Q function into two parts: the target net and the evolving net.
sacrifice speed to get the convergence.
overestimation of Natural DQN
get trouble in left and right dilemma of avoiding bumping on a tree
原文地址:https://www.cnblogs.com/ecoflex/p/9094123.html
时间: 2024-10-05 09:08:23