ICA project - bootstrap sampling
Jun. 22nd, 2007 02:05 amRecently, I've been surprised to learn that statistics is sometimes done empirically. When you can't figure out the distribution of a statistic analytically (as in the case of coefficients estimates given by ICA), you may want to estimate it this way:
* take "bootstrap samples"
* compute the statistic for each sample
* look at the resulting distribution
Here's how Peter put it: what we'd like to do is sample more points from the population. Since we can't do that, the next best thing is to sample with replacement from the original sample, i.e. "bootstrap sampling".
My code is doing this resampling by taking random samples of the same size as the original sample. But we should be able to do better if we had nice combinatorial designs instead. But how do I find such designs?
This is a summary of my progress on this project today:
* do the lower-triangular permutation before pruning. DONE.
* do bootstrap sampling, instead of partition sampling: DONE. We are
now making 50 bootstrap samples.
This is problematic when we don't have enough different points,
because ICA will give us a weight matrix that is not square.
Therefore, I have defined the 'diversity' of a bootstrap sample to be
the number of distinct points. Here is a particularly non-diverse
bootstrap sample:
currentDataSet = X1 X2 X3 X4 X5
0.5284 0.4467 0.4434 0.2567 0.6529
1.0076 -0.5040 1.7242 0.3347 1.5690
0.2867 -0.2046 0.6751 0.2294 -0.1085
1.0076 -0.5040 1.7242 0.3347 1.5690
0.2867 -0.2046 0.6751 0.2294 -0.1085
0.2867 -0.2046 0.6751 0.2294 -0.1085
0.5284 0.4467 0.4434 0.2567 0.6529
-0.8486 -1.1689 -0.4554 -0.8158 -0.3105
-0.9118 0.9590 -1.4452 0.0303 -1.0804
0.5284 0.4467 0.4434 0.2567 0.6529
-0.9118 0.9590 -1.4452 0.0303 -1.0804
0.5284 0.4467 0.4434 0.2567 0.6529
diversity = 5
java.lang.Exception: W is not square!
at edu.cmu.tetrad.search.Shimizu2006Search.lingamDiscovery_Bhat(Shimizu2006Search.java:137)
at edu.cmu.tetrad.search.Shimizu2006Search.pruneEdgesBySampling(Shimizu2006Search.java:593)
at edu.cmu.tetrad.search.Shimizu2006Search.pruneEdges(Shimizu2006Search.java:565)
at edu.cmu.tetrad.search.Shimizu2006Search.lingamDiscovery_DAG(Shimizu2006Search.java:554)W
= 4 x 5 matrix
1.176637 0.338342 -0.926348 -1.769087 1.570344
2.682972 1.148474 -1.742103 -3.004723 0.195186
-9.113527 -0.863811 1.156382 7.309246 3.902053
-3.202869 0.726239 1.405774 4.534239 0.518673
From now on, if the diversity is not greater than the number of
variables, we throw out the bootstrap sample and make another one in
its place.
* do t-test, instead of the ad-hoc 'pruneFactor' method:
The question we want to ask is: "is 0 a plausible value to get from
the distribution just sampled?"
But it seems to me that the question the t-test answers is: "is it
plausible that this value is the mean of the population from which we
have sampled?"
http://en.wikipedia.org/wiki/Student's_t-test#Use
So for now, I'm just testing whether 0 falls within the two-sided 90%
confidence interval of a Gaussian distribution (though I still need
to correct my stdev for the effect of the small sample size (although
50 isn't that small),
[http://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation]
).
If the zero hypothesis is not rejected, the edge is pruned. (in
practice, we are claiming to reject the non-zero hypothesis). This
means that for small sample sizes, since we have a large uncertainty
in the coefficients, everything will be pruned and we get an empty
graph.
Do you want to propose an alternative way?
* Only compute coefficients in PC when all parents are determined. DONE.
We now only have coefficients for those nodes that have a known set of parents.
However, we compare coefficients for all edges that exist in the true
graph and have the correct orientation.
We can start comparing results of Shimizu(ICA) vs. PC, except that Shimizu always returns an empty graph for small sample sizes.
To give Shimizu the benefit of the doubt, I should change their
parameters (e.g. 'alpha') to an optimal setting.
Changing 'alpha' from 0.10 to 0.20 seems to improve it a lot, but it's still erring on the sparse side. I guess I should make errors of
ommission and errors of commission about equally likely, as is the
case with PC. alpha=0.30 seems better than 0.20.
My vague impression is that a two-sided 50% confidence interval will
minimize the total number of errors (i.e. errors of omission + errors
of commission).
* take "bootstrap samples"
* compute the statistic for each sample
* look at the resulting distribution
Here's how Peter put it: what we'd like to do is sample more points from the population. Since we can't do that, the next best thing is to sample with replacement from the original sample, i.e. "bootstrap sampling".
My code is doing this resampling by taking random samples of the same size as the original sample. But we should be able to do better if we had nice combinatorial designs instead. But how do I find such designs?
This is a summary of my progress on this project today:
* do the lower-triangular permutation before pruning. DONE.
* do bootstrap sampling, instead of partition sampling: DONE. We are
now making 50 bootstrap samples.
This is problematic when we don't have enough different points,
because ICA will give us a weight matrix that is not square.
Therefore, I have defined the 'diversity' of a bootstrap sample to be
the number of distinct points. Here is a particularly non-diverse
bootstrap sample:
currentDataSet = X1 X2 X3 X4 X5
0.5284 0.4467 0.4434 0.2567 0.6529
1.0076 -0.5040 1.7242 0.3347 1.5690
0.2867 -0.2046 0.6751 0.2294 -0.1085
1.0076 -0.5040 1.7242 0.3347 1.5690
0.2867 -0.2046 0.6751 0.2294 -0.1085
0.2867 -0.2046 0.6751 0.2294 -0.1085
0.5284 0.4467 0.4434 0.2567 0.6529
-0.8486 -1.1689 -0.4554 -0.8158 -0.3105
-0.9118 0.9590 -1.4452 0.0303 -1.0804
0.5284 0.4467 0.4434 0.2567 0.6529
-0.9118 0.9590 -1.4452 0.0303 -1.0804
0.5284 0.4467 0.4434 0.2567 0.6529
diversity = 5
java.lang.Exception: W is not square!
at edu.cmu.tetrad.search.Shimizu2006Search.lingamDiscovery_Bhat(Shimizu2006Search.java:137)
at edu.cmu.tetrad.search.Shimizu2006Search.pruneEdgesBySampling(Shimizu2006Search.java:593)
at edu.cmu.tetrad.search.Shimizu2006Search.pruneEdges(Shimizu2006Search.java:565)
at edu.cmu.tetrad.search.Shimizu2006Search.lingamDiscovery_DAG(Shimizu2006Search.java:554)W
= 4 x 5 matrix
1.176637 0.338342 -0.926348 -1.769087 1.570344
2.682972 1.148474 -1.742103 -3.004723 0.195186
-9.113527 -0.863811 1.156382 7.309246 3.902053
-3.202869 0.726239 1.405774 4.534239 0.518673
From now on, if the diversity is not greater than the number of
variables, we throw out the bootstrap sample and make another one in
its place.
* do t-test, instead of the ad-hoc 'pruneFactor' method:
The question we want to ask is: "is 0 a plausible value to get from
the distribution just sampled?"
But it seems to me that the question the t-test answers is: "is it
plausible that this value is the mean of the population from which we
have sampled?"
http://en.wikipedia.org/wiki/Student's_t-test#Use
So for now, I'm just testing whether 0 falls within the two-sided 90%
confidence interval of a Gaussian distribution (though I still need
to correct my stdev for the effect of the small sample size (although
50 isn't that small),
[http://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation]
).
If the zero hypothesis is not rejected, the edge is pruned. (in
practice, we are claiming to reject the non-zero hypothesis). This
means that for small sample sizes, since we have a large uncertainty
in the coefficients, everything will be pruned and we get an empty
graph.
Do you want to propose an alternative way?
* Only compute coefficients in PC when all parents are determined. DONE.
We now only have coefficients for those nodes that have a known set of parents.
However, we compare coefficients for all edges that exist in the true
graph and have the correct orientation.
We can start comparing results of Shimizu(ICA) vs. PC, except that Shimizu always returns an empty graph for small sample sizes.
To give Shimizu the benefit of the doubt, I should change their
parameters (e.g. 'alpha') to an optimal setting.
Changing 'alpha' from 0.10 to 0.20 seems to improve it a lot, but it's still erring on the sparse side. I guess I should make errors of
ommission and errors of commission about equally likely, as is the
case with PC. alpha=0.30 seems better than 0.20.
My vague impression is that a two-sided 50% confidence interval will
minimize the total number of errors (i.e. errors of omission + errors
of commission).