Overestimation occurs due to the fact that the max operator in standard Q-learning and DQN uses the same values both to select and to evaluate
an action. For example you calculate Q’(s’,a’_1)=0.8 and Q’(s’,a’_2)=0.75 in the TD-Target. You choose the value Q which corresponds to the action a’_1 because 0.8>0.75. The problem here is that a’_2 might be in fact a better action, although the Quality has a slightly lower value here.

Remember you use the same network both for estimation of an action and evaluation in DQN. It was shown that using two different networks for it in Double Q-learning resolves this problem of overestimation.

Deep Learning & AI Software Developer | MSc. Physics | Educator|

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store