Jan. 2nd, 2009

gusl: (Default)
Suppose you have points on the real line, from two classes, e.g. random samples from the sampling distribution of a certain statistic, under two experimental conditions, C1 and C2.



You want to confidently state that, within a certain p-value, having observed a sample from the sampling distribution under C1 and a sample from the sampling distribution under C2, the two populations really are different wrt the statistic.

So you want to count how often the trend is broken (how often a random red will be bigger than a random blue). This is the p-value.



In the above diagram, the proportion of white is an estimate of the p-value. This estimate is consistent in the sense that it converges to the true p-value when both n1 and n2 go to infinity.


Naively, you could make a comparison for each of the n1*n2 combinations, but it's more efficient to do the following:

* sort them
* keep only the order information. The above data becomes the string 001011001111011.
* loop over the characters
** keep a count of the 0s
** whenever we see a 1, we print the number of zeros we have seen so far

it's essentially computing a discrete CDF of the distribution of the 0s, where the 1s define the measure of the sample space.

---

UPDATE (15Jan): this can be used to test whether two samples come from the same distribution, e.g. the Two-Sample Cramer-von Mises test.

February 2020

S M T W T F S
      1
2345678
9101112131415
16171819202122
23242526272829

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags