Artem Oppermann
1 min readNov 8, 2018

--

Hello,

Overestimation occurs due to the fact that the max operator in standard Q-learning and DQN uses the same values both to select and to evaluate
an action. For example you calculate Q’(s’,a’_1)=0.8 and Q’(s’,a’_2)=0.75 in the TD-Target. You choose the value Q which corresponds to the action a’_1 because 0.8>0.75. The problem here is that a’_2 might be in fact a better action, although the Quality has a slightly lower value here.

Remember you use the same network both for estimation of an action and evaluation in DQN. It was shown that using two different networks for it in Double Q-learning resolves this problem of overestimation.

--

--

Artem Oppermann
Artem Oppermann

No responses yet