gusl | bootstrap sampling (Reply)

Bootstrap sampling is a procedure for estimating (properties of) the sampling distribution of a statistic when you can't sample directly from the population (i.e. the typical setting), but have a limited IID sample S.

If we could sample directly from the population, we would do the following: sample, compute the statistic, repeat. This way, we are sampling directly from the sampling distribution, which means we have a (nearly)consistent estimate.

When we don't have access to the population (i.e. can't sample directly from it), the best we can do is reuse the data points, "recycle" if you will. In bootstrap sampling, you sample from the original sample S with replacement (typically creating a sample of the same size as S).

Problems:
* bootstrap samples have less "diversity" than the original sample (except when they are exactly the same as the original, up to permutation). This can hurt when you end up with resamples that have lower rank, which happens most often when the k << n assumption (# features << size of sample) does not hold. I ran into this when working with ICA, leading me to modify the bootstrap procedure to reject samples that have too many repetitions (too little "diversity").¹

Misconceptions:
* treating bootstrap samples as samples from the population. If you're trying to test the hypothesis that the median of a continuous-valued population is greater than some number, you'll use some procedure which gets arbitrarily confident as the sample size goes to infinity. But in the bootstrap setting, you can't get arbitrarily confident... if you use such a procedure, you'll converge on the sample median instead! (I'd love to quote an information-theoretic theorem to the effect of: the population never reveals herself completely in finite samples). Instead, for a large enough number of resamples, you may wish to do a quantile test (for 95% confidence, check whether >95% of the resampled medians are greater than the value appearing in the null hypothesis).
For the same reason, it should be obvious that bootstrapping is a pretty bad procedure for some statistics. I have a feeling that this is especially the case when the true sampling distribution has a large spread (at the given sample-size).

Query:
Why resort to randomness? I think we should use combinatorics instead. If we want to do optimal bootstrapping, I imagine there should be codebooks out there with good bootstraps. An ideal set of resamples wouldn't leave any big gaps in the space of possible resamples. Maximizing the minimum distance between resamples sounds like it would be a good heuristic. I imagine there's also room for algebraic cleverness.

We can view the original sample as having information about the population, and the resamples as having information about the original sample. Does this shed any light on how to pick resamples optimally?

^{1 - One team member thought I was violating the sacredness of the bootstrap procedure. I should have named the file "Shoestrap.java".}