Hi,

V(s) is calculated by a neural network with no hidden layers and a linear output function (in the case of my implementation on GitHub). It works the best with no hidden layers for some reason. Thus its nothing more than a linear classifier.

I never bothered to understand fully the mathematical reason why the variance can be reduced this way since I only needed the models to work. But the derivation for this can be found here :

*Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., and Lee, M. (2007). Incremental natural actor-critic algorithms. In Neural Information Processing Systems 21.*