This question concerns an implementation of the topmoumoute natural gradient (tonga) algorithm as described in page 5 in the paper Le Roux et al 2007 http://research.microsoft.com/pubs/64644/tonga.pdf.
I understand that the basic idea is to augment stochastic gradient ascent with the covariance of the stochastic gradient estimates. Basically, the natural gradient approach multiplies a stochastic gradient with the inverse of the covariance of the gradient estimates in order to weight each component of the gradient by the variance of this component. We prefer moving into directions that show less variance during the stochastic gradient estimates:
$ng \propto C^{-1} g$
Since updating and inverting the covariance in an online optimisation setting is costly, the authors describe a ``cheap'' approximate update algorithm as described on page 5 as:
$C_t = \gamma \hat{C}_{t-1} + g_tg_t^T$ where $C_{t-1}$ is the low rank approximation at time step t-1. Writing $\hat{C}_{t} = X_tX_t^T$ with $X_t =\sqrt{\gamma} X_{t-1}\ \ g_t]$ they use an iterative update rule for the gram matrix $X_t^T X_T = G_t = \begin{pmatrix}\gamma G_{t-1}&\sqrt{\gamma} X^T_{t-1}g_t\\ \sqrt{\gamma} g^T_t X_{t-1}&g_t^Tg_t\end{pmatrix}$
They then state ``To keep a low-rank estimate of $\hat{C}_{t} = X_tX_t^T$, we can compute its eigendecomposition and keep only the first k eigenvectors. This can be made low cost using its relation to that of the Gram matrix
$G_t= X_t^T X_T$:
$G_t = VDV^T$
$C_t = (X_tVD^{-\frac12})D(X_tVD^{-\frac12})^T$''
Because it's cheaper than updating and decomposing G at every step, they then suggest that you should update X for several steps using $C_{t+b} = X_{t+b}X_{t+b}^T$ with $X_{t+b} = \left[\gamma U_t, \ \gamma^{\frac{b-1}{2}g_{t+1}},...\ \ \gamma^{-\frac12}g_{t+b-1}, \ \gamma^{\frac{t+b}{2}g_{t+b}}\right]$
I can see why you can get $C_t$ from $G_t$ using the eigendecomposition. But I'm unsure about their update rule for X. The authors don't explain where U is coming from.
My first assumptions (by notation) was, that this is the first k eigenvectors of $C_t$.
But this makes little sense.
My second assumption was that:
$U = (X_tVD^{-\frac12})D^{\frac12}$ because then $UU^T \approx XX^T = C$, but again I'm not sure if this is reasonable, because I don't think that then $[\gamma U, g]$ is a good approximation for $[\gamma X_{t-1}, g]$.
So if anyone has any advice as to what U should be, that would be much appreciated. I'm not getting good results using this method and want to make sure I have implemented it as intended.