algorithm for frequentist order statistics
Jan. 2nd, 2009 04:13 pmSuppose you have points on the real line, from two classes, e.g. random samples from the sampling distribution of a certain statistic, under two experimental conditions, C1 and C2.

You want to confidently state that, within a certain p-value, having observed a sample from the sampling distribution under C1 and a sample from the sampling distribution under C2, the two populations really are different wrt the statistic.
So you want to count how often the trend is broken (how often a random red will be bigger than a random blue). This is the p-value.

In the above diagram, the proportion of white is an estimate of the p-value. This estimate is consistent in the sense that it converges to the true p-value when both n1 and n2 go to infinity.
Naively, you could make a comparison for each of the n1*n2 combinations, but it's more efficient to do the following:
* sort them
* keep only the order information. The above data becomes the string 001011001111011.
* loop over the characters
** keep a count of the 0s
** whenever we see a 1, we print the number of zeros we have seen so far
it's essentially computing a discrete CDF of the distribution of the 0s, where the 1s define the measure of the sample space.
---
UPDATE (15Jan): this can be used to test whether two samples come from the same distribution, e.g. the Two-Sample Cramer-von Mises test.
You want to confidently state that, within a certain p-value, having observed a sample from the sampling distribution under C1 and a sample from the sampling distribution under C2, the two populations really are different wrt the statistic.
So you want to count how often the trend is broken (how often a random red will be bigger than a random blue). This is the p-value.
In the above diagram, the proportion of white is an estimate of the p-value. This estimate is consistent in the sense that it converges to the true p-value when both n1 and n2 go to infinity.
Naively, you could make a comparison for each of the n1*n2 combinations, but it's more efficient to do the following:
* sort them
* keep only the order information. The above data becomes the string 001011001111011.
* loop over the characters
** keep a count of the 0s
** whenever we see a 1, we print the number of zeros we have seen so far
it's essentially computing a discrete CDF of the distribution of the 0s, where the 1s define the measure of the sample space.
---
UPDATE (15Jan): this can be used to test whether two samples come from the same distribution, e.g. the Two-Sample Cramer-von Mises test.