gusl | L1 regularization

Entry tags:

L1 regularization

L2 regularization is seen a way to avoid overfitting when doing regression, no more.

L1 regularization tends to give sparse results. If the truth is sparse, this is seen as a way to get to the truth (although this is not always consistent, which is why we have Bolasso).

Even if the truth is not sparse, L1 may be seen as an Occam's razor. Is this a valid view?

Even if the truth is not sparse, L1 is a way to select a small number of variables, which can be useful for those of us concerned with scarce computational resources (although it's not clear why you'd choose L1 over PCA or Partial Least Squares)

Flat | Top-Level Comments Only

What advantage does PCA or PLS offer over L1?

I've been thinking about this too, because Optimality Theory, the favorite model of a lot of linguists, is something like a crude approximation of a logistic regression, with assumptions that radically reduce the number of active variables. The effect of L1 regularization is similar to the effect of OT assumptions, but less radical.

If your data forms a multivariate Gaussian, PCA minimizes reconstruction error (unsupervised case) and PLS minimizes prediction error (supervised case).

While L1 returns the subset of the original variables it considers to be nonzero (you don't specify how many, though you could tweak the regularization parameter until it returns the desired number), PCA/PLS return a pre-specified number of linear mixtures of the original variables.

<< Optimality Theory, the favorite model of a lot of linguists, is something like a crude approximation of a logistic regression, with assumptions that radically reduce the number of active variables. >>

Please tell me more!

Besides completely throwing out variables that are near zero weight, regularization (and I guess the spacial transformations of PCA and PLS do this too) also reduces the number of relevant variables in any particular case. Spreading out the weight distribution makes it easier to approximate the result by just including the largest few terms in each case.

Optimality Theory originally grew out of a perceptron-like formalism, with numerical weights on the variables and a categorical output, and the original analogies between perceptrons and OT kind of suggest an L1 exponential prior with discrete (discontinuous) support. The move to OT switches from comparison of numerical sums to a decision-tree-like comparison of candidate outputs, using the variables in order of importance. It essentially breaks down each case into a bunch of candidate decisions, and in each decision it only pays attention to the highest variable that distinguishes between the two candidates. This was motivated by an apparent scarcity in language of the kind of "ganging up" sum effects that you see in perceptrons. But there remain a few phenomena that look like "ganging up", and recently people have been looking beyond the categorical phenomena that are traditional in theoretical linguistics. The probabilistic phenomena seem to require some allowance for these ganging up effects. But they still seem less common than we would expect given a uniform prior, or even the L1 exponential prior.

I suspect that there is an interesting explanation for all of this, but I'm kind of stuck at this point about how to look for it. I wrote a long term paper about this 6 months ago, but then came to back to China, so between the shortage of people who have a good understanding of both learning theory and language phenomena, and being away from my peeps in SD who are obligated to read the paper, I've not had any feedback on it. I'm going back to SD in a few weeks, so maybe I'll continue with it then.

<< Besides completely throwing out variables that are near zero weight, regularization ... >>

First of all, this is *L1* regularization.
Secondly, no, not *near* zero weight. L1 methods throw out the subset of variables whose exclusion least hurts (in terms of prediction error).

You're throwing out variables that are within your tolerance level of approximately zero, right? I thought this tolerance was usually set to something well above machine tolerance.

<< You're throwing out variables that are within your tolerance level of approximately zero, right? >>

I don't understand the question.

I'm not familiar with your terminology. I also don't know this stuff very well.

btw, you moved to UCSD??

Yeah. I'm currently on a leave of absence from UCSD Linguistics. I'll be back there next quarter.

Flat | Top-Level Comments Only

L1 regularization

Ooh. Tell me more.

Re: Ooh. Tell me more.

Re: Ooh. Tell me more.

Re: Ooh. Tell me more.

Re: Ooh. Tell me more.

Re: Ooh. Tell me more.

Re: Ooh. Tell me more.

Re: Ooh. Tell me more.