multivariate two-sample tests
Jun. 12th, 2009 03:33 pmA summer school colleague who is a neuroscientist recently asked for a way to test whether distributions in 4D space are the same.
First, I came up with the idea of using Voronoi cells as bins: let sample 1 define the Voronoi cells (bins), and points from sample 2 will fall into them. If they come from the same distribution, the set of bins will have a uniform distribution. (Computing how unlikely a deviation is isn't that easy: if sample 1 were infinite, the bins would have a binomial distribution same with the same parameter, but what should you do in reality?)
To prove my claim of uniformity, we need the fact that, if we're sampling IID from a joint random variable, then there exists a constant k such that for any point p, the area of the Voronoi cell around p converges in probability to k/n(density at p), as n -> infty. (I think that k=1) [Did it state this too strongly?]
Then I did a brief literature review, which I quite enjoyed doing because all these ideas are so intuitive.
In some of these tests, you can flip around the two samples, and run the test again. In such cases, the p-value should ideally be defined by a combination of the two scores (it seems a priori plausible that one direction rejects, while the other direction fails to).
First, I came up with the idea of using Voronoi cells as bins: let sample 1 define the Voronoi cells (bins), and points from sample 2 will fall into them. If they come from the same distribution, the set of bins will have a uniform distribution. (Computing how unlikely a deviation is isn't that easy: if sample 1 were infinite, the bins would have a binomial distribution same with the same parameter, but what should you do in reality?)
To prove my claim of uniformity, we need the fact that, if we're sampling IID from a joint random variable, then there exists a constant k such that for any point p, the area of the Voronoi cell around p converges in probability to k/n(density at p), as n -> infty. (I think that k=1) [Did it state this too strongly?]
Then I did a brief literature review, which I quite enjoyed doing because all these ideas are so intuitive.
In some of these tests, you can flip around the two samples, and run the test again. In such cases, the p-value should ideally be defined by a combination of the two scores (it seems a priori plausible that one direction rejects, while the other direction fails to).