Apr. 8th, 2009

gusl: (Default)
"Exponentiated gradient" seems like a really simple idea. But I'm struggling with the notation.

I suspect that the gradient with respect to an expert (feature) is computed by looking at derivative of the loss with respect to it while leaving all other weights the same. It's a wonderful property of multivariate continuous functions that we can compute the gradient this way (the gradient is defined as the direction of steepest ascent, and its slope). I almost want to call it a conditional independence property.

But suppose you're starting at the origin and the true (w1, w2) = (+1,-1), i.e. the true model is y = x1 - x2 + noise. Now suppose the experts x1 and x2 are positively correlated with slope 1. Then your gradient steps will tend to be close to perpendicular to the direction you want to go in. In an extreme case of perfect correlation, this would be an ill-conditioned problem. But I digress.

Back to exponentiated gradient: why are the gradient steps replaced by multiplying by η exp(-loss)? Taking logs, we get log(wt+1) = - η loss + log(wt), which is a normal gradient descent for log(wt). But what is the reason for working in log-space?
gusl: (Default)
Suppose you have a stochastic algorithm. You run it on 4 instances, a bunch of times each.
Your runtime looks something like this (ignore the y-axis):



You'd like to make a statement about how much of the variability is due to the stochasticity of the algorithm vs the properties of the instances.

What would you compute? Non-parametric ANOVA?

February 2020

S M T W T F S
      1
2345678
9101112131415
16171819202122
23242526272829

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags