gusl | Entries tagged with stats

My experience with applied statistics has been fun in large part because it involved making the Bayesian rubber meet the road of inherent computational complexity, and because it allowed me to define and explore theoretical concepts and tie them to our applied problem... I also had a good time formalizing and unifying ideas, making them more coherent and clearer. And it has been great to work with dedicated people. I'm still learning from my new experience in a mentoring role.

But perhaps the most formative thing is the periodic mindfuck that comes with realizing how wrong I am. It is the frustration of realizing that missing data reflects perfectly innocent biases (leaving me with little recourse, since I don't have a model of our colleagues who decide how to collect data, and what to report). It is questioning maximum-likelihood. Last year, I saw how one can expect to make better predictions by deliberately plugging in a parameter value that one knows is smaller than the truth (in simulated data, no misspecification). Now I am seeing how a better search algorithm leads to worse prediction. Maybe it is time to penalize the likelihood afterall... But it's not clear where overfitting may be coming from, or what makes one block structure more complex than another.

In supervised classification problems, we can define complexity families (e.g. linear separators are very simple; quadratic ones are less simple, etc) and define penalties proportional to the complexity measure... but what would you do if there was no geometry to the space, or if you were ignorant of geometry to begin with? And what makes one complexity measure (or one geometry) superior to another anyway? (It is not helpful to mention that Kolmogorov Complexity provides a universal measure of complexity; although I do enjoy this brand of optimism)

Someday I would like a way to make these lines transparent, as a way of avoiding overplotting.

Are likelihood functions ever non-differentiable?

Of course, you can always transform the parameters in such a way that you get a kink in the function... My question is whether this ever occurs naturally.

---

UPDATE: clearly, if the likelihood for each data point is differentiable, then so is the likelihood for the whole data.

I'm interested in the effective sample size of my Gibbs sample.

Plotting the autocorrelation function, I notice that the even lags have a higher much autocorrelation than the odd lags. But if you skip odd numbers (or the even numbers), it looks smooth. It's interesting to consider what may have caused this.

I'm also wondering if R's effectiveSize will still work correctly.

I am wondering if, instead of running MCMC and hoping that it has "mixed" ("achieved stationarity"), there are approaches based on computing (or approximating) the principal left eigenvector of the transition matrix.

Of course, in continuous spaces, this "matrix" has as many entries as S^2, where S is the space our parameters live in... so our "principal eigenvector" becomes the "principal eigenfunction". Functional analysts, how do you compute this?

If it helps, we might want to choose a sparse proposal (such as the one corresponding to Gibbs sampling, in which all transitions changing more than one parameter have probability density zero)

This is paper draft that is probably more worthy of being a blog post, so I decided to blogize it. Here we go:

Abstract:

It is well-known that Independent Component Analysis (ICA) is not identifiable when more than one source is Gaussian (Comon 1993). However, the question of how identifiable it is is rarely asked. In this paper, we show a generalization of the identifiability result, namely that the number of degrees of freedom in the observational equivalence class of ICA is k-1, where k is the number of Gaussians in the sources.

However, since the sources are latent variables, k is unknown and needs to be estimated. We frame this as a problem of "hyper-equatorial regression" (i.e. find the hyper-great-circle that best fits the resampled points on the hypersphere), which is solved by doing PCA and discarding components beyond the knee of the eigenspectrum.

Having estimated the Gaussian subspace, we can then project the data onto it, and now we have a fully identifiable ICA.

We conclude by presenting a simple modification of the FastICA algorithm that only returns the identifiable components, by exiting early if non-Gaussian components can't be found. This is efficient because it avoids the bootstrap and clustering steps.

Independent component analysis is a statistical technique used for estimating the mixing matrix A in equations of the form x = A s, where x is observed and s and A are not. s is usually called the "sources".

The matrix A can be identified up to scaling and column-permutations as long as the observed distribution is a linear, invertible mixture of independent components, at most one of which is Gaussian.

This note is concerned with the unidentifiability that results from more than one source being Gaussian, in which, rather than a single mixing matrix, there is a proper subspace of mixing matrices that leads to the observed joint distribution in the large sample limit.

The observational equivalence class of ICA (modulo scaling and permutation) is the set of mixing matrices that can be obtained by applying an invertible linear transformation of the Gaussian components while maintaining the non-Gaussian components fixed. If we order the components so that the Gaussian ones come first, the observational equivalence class for any maxing matrix A can be written as the set {MA | M is like the identity matrix except for k-by-k block on the top left, which is a general invertible matrix}.

If we have 2 Gaussians on a large number of sources, ICA is far from useless: rather, all but 2 components can be identified, corresponding to the two Gaussian sources. The observational equivalence class has one degree of freedom, as can be seen in (b) below.

Now we illustrate the behavior of the ICA statistic on an example with 3 sources, by resampling and plotting the directions of the resulting components on the unit sphere:

As we can see, in Fig (b), the two components that lie on the circle can rotate around an axis. The next figure provides a 3D plot of this.

ToDo: implement proper clustering, modify FastICA, write a proper report

Having run a clustering procedure to eliminate the tight clusters, we use PCA to find the k-plane around which the points on the ring lie. This gives us the Gaussian subspace.

Thanks to Ives Macedo for discussions.

Being fond of non-parametrics, I have an instinctive dislike of Gaussian models (maybe because they are overused by people who don't know statistics).

Q: So why might one be satisfied with the Gaussian model, while knowing that it is very wrong (e.g. if you know the truth should be a skewed, fat-tailed distribution, e.g. a power-law distribution)?

A: Because it will estimate the mean and variance "correctly". For any prior, if you Gaussianize it, and do the usual Bayesian update, the posterior obtained will have the same (μ,σ²) as the (μ,σ²) obtained by using the original prior... and using this you can get confidence bounds on the quantiles of the posterior e.g. through Chebyshev's inequality. It's not giving you maximum power, but it's simple.

I think this also extends to the multivariate case, i.e. all the means and covariances are preserved when you Gaussianize the prior. And you can still use some analog of Chebyshev's inequality that uses the Mahalanobis distance from the mean.

Can one estimate the correlation between two variables when no joint data is observed?

Imagine you have samples from a 2D joint distribution... except that you don't have joint observations, because the x and y observations are unmatched. Your data can be plotted using vertical and horizontal lines (red lines on figure below).

My surprising realization was that the marginals give some evidence for how the x- and y- readings are matched (especially strong evidence when the true correlation is close to 1 or -1, i.e. mutual information is high)
( Read more... )

According to the sources I've read, Gibbs sampling doesn't really tell you how to sample... it's just divide-and-conquer: you still need to sample each variable from its full conditional.

In my problem, that would be more trouble than it's worth, since no denominators are canceling. So I'm doing good old Metropolis-Hastings, with a proposal that makes its trajectory resemble Gibbs: it randomly chooses one variable to modify, and then randomly chooses a value for that variable. In other words, the proposal is uniform over the neighbors in the "generalized hypercube"^*.

I can easily compute the posterior. In traditional MCMC, I think you would weight the samples by how often they appear. But doesn't it make more sense to directly compute the posterior in all sampled models?

Should I be using MCMC at all?
How else am I going to find the set of high-probability models? (Maybe what I want is a mode-oriented stochastic search, as my EA project did.

Believe it or not, it's my first time actually implementing MCMC.

Also, I'd like to know what proportion of the posterior mass my sampled models account for... but this is probably VeryHard to do if I can't enumerate the models (otherwise we could know whether the chain has mixed).

^{* - what do you call a non-binary hypercube? I mean: let each node be a string in {0,...,k-1}^n, and neighborhood relation is defined by set of strings that differ in exactly 1 character. When k=2, we get the n-dim hypercube. What's the word when n>2?}.

What do you call a bootstrap in which you sample without replacement? This way the resamples will tend to look like plausible samples from the population, albeit smaller than the original sample.

Besides, you don't want to sample with replacement if you're:
(a) working with Chinese Restaurant Processes: you could be falsely led to believe that customers are more gregarious than they really if you count multiple clones of the same customer at the same table as different customers.
or are interested in local statistics such as:
(b) cluster tightness: cloned points form tight clusters indeed!
(c) mode of a distribution: same idea.

Why is the standard to sample with replacement? And if you're sampling with replacement, why is it important for the resample to have the same size as the original sample? Why not smaller? Why not bigger?

Although a smaller resample-size will lead to broader sampling distributions for each resample, a big resample-size will make the resamples arbitrarily similar to the original sample (thanks to large sample results)^*. It's unclear what exactly you want to optimize.

Tangentially, I've never heard of deterministic resampling: surely it is better to use some sort of combinatorial design than to pick your resamples randomly.

^{* - furthermore, leading some people to be falsely confident (by mistaking a large resample from the original sample for a large sample from the population).}

* Bayes Rule: "P(A,B) = P(A|B) P(B)" means "forall a,b . P(A=a, B=b) = P(A=a|B=b) P(B=b)". This is very standard.

* "variance of the estimator" means "variance of the sampling distribution of the estimator". AFAICT, this is unambiguous, and the only reasonable interpretation is for "estimator" to mean the random variable. To make this even more explicit: the estimator(RV) is the result of applying the estimator(function) to the random data.

* "estimate the parameters" means "estimate the values of the parameters"; more confusingly, "choose the parameters" can mean "choose the values of the parameters". This may just be the econometricians I've been reading.

* "distribution" to mean "family of distributions". Very standard. No one blinks an eye at "the Gaussian distribution". I think "family" is typically only used to describe families for which mean and variance are not sufficient statistics.

* "sample" to mean "data point". One should be careful here: in standard usage, a "sample" is a collection of data points. Sometimes, though, one samples just one point, and metonymically calls it "the sample".

* using "correlated" to mean "dependent". This is incorrect, except in special circumstances, such as multivariate Gaussian models.

* using "sufficient statistics" to mean "summary statistics" (e.g. in the context of mean-field approximations). This is incorrect.

---

UPDATE: I should write a SigBovik paper titled "Introduction to Statistical Pedantics".

Bootstrap sampling is a procedure for estimating (properties of) the sampling distribution of a statistic when you can't sample directly from the population (i.e. the typical setting), but have a limited IID sample S.

If we could sample directly from the population, we would do the following: sample, compute the statistic, repeat. This way, we are sampling directly from the sampling distribution, which means we have a (nearly)consistent estimate.

When we don't have access to the population (i.e. can't sample directly from it), the best we can do is reuse the data points, "recycle" if you will. In bootstrap sampling, you sample from the original sample S with replacement (typically creating a sample of the same size as S).

Problems:
* bootstrap samples have less "diversity" than the original sample (except when they are exactly the same as the original, up to permutation). This can hurt when you end up with resamples that have lower rank, which happens most often when the k << n assumption (# features << size of sample) does not hold. I ran into this when working with ICA, leading me to modify the bootstrap procedure to reject samples that have too many repetitions (too little "diversity").¹

Misconceptions:
* treating bootstrap samples as samples from the population. If you're trying to test the hypothesis that the median of a continuous-valued population is greater than some number, you'll use some procedure which gets arbitrarily confident as the sample size goes to infinity. But in the bootstrap setting, you can't get arbitrarily confident... if you use such a procedure, you'll converge on the sample median instead! (I'd love to quote an information-theoretic theorem to the effect of: the population never reveals herself completely in finite samples). Instead, for a large enough number of resamples, you may wish to do a quantile test (for 95% confidence, check whether >95% of the resampled medians are greater than the value appearing in the null hypothesis).
For the same reason, it should be obvious that bootstrapping is a pretty bad procedure for some statistics. I have a feeling that this is especially the case when the true sampling distribution has a large spread (at the given sample-size).

Query:
Why resort to randomness? I think we should use combinatorics instead. If we want to do optimal bootstrapping, I imagine there should be codebooks out there with good bootstraps. An ideal set of resamples wouldn't leave any big gaps in the space of possible resamples. Maximizing the minimum distance between resamples sounds like it would be a good heuristic. I imagine there's also room for algebraic cleverness.

We can view the original sample as having information about the population, and the resamples as having information about the original sample. Does this shed any light on how to pick resamples optimally?

^{1 - One team member thought I was violating the sacredness of the bootstrap procedure. I should have named the file "Shoestrap.java".}

Over lunch on Friday, Michael Littman told me the following idea: Suppose you have a procedure for making confidence intervals out of an IID sample from a population. If you run this sample/estimate cycle multiple times, the resulting confidence intervals should mostly overlap with each other. This is a coherence constraint, which I'll try to make more precise (later), the intuition being: the higher the confidence levels, the more they should overlap (by some measure).

This also suggests a bootstrap procedure to produce multiple confidence intervals, as a way to approximate the sampling distribution of the "confidence interval" statistic.

Student (a.k.a. William Sealy Gosset) has supposed dealt with these questions, including some question about infinite regress, but I haven't yet found the reference.

Hypothesis: In exponential family distributions, the moments are sufficient statistics. Prove or refute.

Your answer could fill what seems to be a Google gap.

Regarding Google's ability to answer such queries: my previous post on the triangle equality for correlation distance (including

en_ki's elegant solution) seems to have fallen off Google's map. I guess that's what happens when people don't link to you.

Suppose you have points on the real line, from two classes, e.g. random samples from the sampling distribution of a certain statistic, under two experimental conditions, C1 and C2.

You want to confidently state that, within a certain p-value, having observed a sample from the sampling distribution under C1 and a sample from the sampling distribution under C2, the two populations really are different wrt the statistic.

So you want to count how often the trend is broken (how often a random red will be bigger than a random blue). This is the p-value.

In the above diagram, the proportion of white is an estimate of the p-value. This estimate is consistent in the sense that it converges to the true p-value when both n1 and n2 go to infinity.

Naively, you could make a comparison for each of the n1*n2 combinations, but it's more efficient to do the following:

* sort them
* keep only the order information. The above data becomes the string 001011001111011.
* loop over the characters
** keep a count of the 0s
** whenever we see a 1, we print the number of zeros we have seen so far

it's essentially computing a discrete CDF of the distribution of the 0s, where the 1s define the measure of the sample space.

---

UPDATE (15Jan): this can be used to test whether two samples come from the same distribution, e.g. the Two-Sample Cramer-von Mises test.

Given a dataset with 3 variables A, B, C, you observe a value for the correlation between A and B, and another value for the correlation between B and C.

Question: what are the possible values of the correlation between A and C?

Answer (due to en_ki): if your dataset has n points, center it, and consider A, B and C as vectors in ℜⁿ. The empirical correlation between two variables is the cosine of the angle between them.

Thus the question becomes easy: what are the possible values of the angle AC, given the angles AB and BC?

---

Original post, for the record: ( Read more... )

Why do we care about finding MLE, MAP parameter-settings? The focus on the maximum reminds me of the mode, a statistic that one doesn't care about most of the time. (Instead, one tends to focus on the mean and median).

Take the graph below:

likelihood graph

If your posterior looks like this, is the red circle really your best guess about theta?

Why don't we work with the full uncertainty about theta? Because it tends to be computationally expensive to do so?

---

Suppose you have a very expensive Bayesian update. The standard answer is to use MCMC to sample from the posterior. But suppose you're not interested in the whole posterior, just a small region of it (or a particular point even, which is a region of measure zero). Are there ways to prune away points that are going to end up outside my region of interest, or to estimate my point?

S	M	T	W	T	F	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29

Gustavo Lacerda

my experience with applied statistics; my current brainmelt

plots that make me happy

non-differentiable likelihood functions

periodicity in the autocorrelation function of MCMC

MCMC and principal eigenvectors

on the identifiability of ICA under multiple Gaussian sources

Gaussian models: one justification

estimating correlations from unmatched data

Gibbs sampling

bootstrap without replacement

examples of metonymy and ambiguities in statistics, with varying degrees of harm

bootstrap sampling

confidence intervals about confidence intervals

family parametrized by the moments

algorithm for frequentist order statistics

triangle inequality for correlation distance

why MLE, MAP?

Profile

February 2020

Syndicate

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags