gradient descent in online learning
Apr. 8th, 2009 12:51 am"Exponentiated gradient" seems like a really simple idea. But I'm struggling with the notation.
I suspect that the gradient with respect to an expert (feature) is computed by looking at derivative of the loss with respect to it while leaving all other weights the same. It's a wonderful property of multivariate continuous functions that we can compute the gradient this way (the gradient is defined as the direction of steepest ascent, and its slope). I almost want to call it a conditional independence property.
But suppose you're starting at the origin and the true (w1, w2) = (+1,-1), i.e. the true model is y = x1 - x2 + noise. Now suppose the experts x1 and x2 are positively correlated with slope 1. Then your gradient steps will tend to be close to perpendicular to the direction you want to go in. In an extreme case of perfect correlation, this would be an ill-conditioned problem. But I digress.
Back to exponentiated gradient: why are the gradient steps replaced by multiplying by η exp(-loss)? Taking logs, we get log(wt+1) = - η loss + log(wt), which is a normal gradient descent for log(wt). But what is the reason for working in log-space?
I suspect that the gradient with respect to an expert (feature) is computed by looking at derivative of the loss with respect to it while leaving all other weights the same. It's a wonderful property of multivariate continuous functions that we can compute the gradient this way (the gradient is defined as the direction of steepest ascent, and its slope). I almost want to call it a conditional independence property.
But suppose you're starting at the origin and the true (w1, w2) = (+1,-1), i.e. the true model is y = x1 - x2 + noise. Now suppose the experts x1 and x2 are positively correlated with slope 1. Then your gradient steps will tend to be close to perpendicular to the direction you want to go in. In an extreme case of perfect correlation, this would be an ill-conditioned problem. But I digress.
Back to exponentiated gradient: why are the gradient steps replaced by multiplying by η exp(-loss)? Taking logs, we get log(wt+1) = - η loss + log(wt), which is a normal gradient descent for log(wt). But what is the reason for working in log-space?