probability theory
Jan. 19th, 2007 06:14 pmThe Chi-squared statistic gives us a test of statistical independence between n variables with "unordered" values (i.e. there are no relations between possible values, e.g. {apple, orange, tomato} rather than {0, 0.5, 1} ).
The Chi-squared statistic is the sum of the squared error for each possible value in the joint distribution. This error being the difference between the observed value and value computed from the marginal distributions with the independence assumption (i.e. multiplying the marginals).
How can you do better if you have a structure over the values of these variables (either some of them, or all of them)? Remember: A correlation coefficient of 0 does not imply independence.
In order to do better, I think it's necessary to use some sort of continuity assumption.
---
I'm imagining an analog of the Chi-squared in 2 dimensions, in which you:
* estimate the joint density using kernels (i.e. fuzzy blobs at each data point, possibly sharper in higher-density regions (this would need some bootstrapping) )
* make a grid (i.e. a 2D histogram)
* perform a Chi-squared, using the "weight" of each little rectangle as the joint frequencies, and the weight of rows and columns as the marginal frequencies.
In the limit, as the 2D histogram gets very fine, I think this will converge to the "optimal" statistic. But degrees of freedom get pretty big that way. I don't know if that's a problem.
It seems I should be able to do better, by somehow getting rid of the arbitrariness of where histogram makes the cutoffs. Maybe the solution is to make the weight of a rectangle depend on the weight the neighboring rectangles in a fuzzy continuous way. But don't the kernels already do this? But to really be continuous, it seems like I should get rid of rectangles altogether.
Another idea is to modify the Chi-squared statistic so that it takes neighbouring points into account. But again, it seems like using kernels already does the same thing.
The Chi-squared statistic is the sum of the squared error for each possible value in the joint distribution. This error being the difference between the observed value and value computed from the marginal distributions with the independence assumption (i.e. multiplying the marginals).
How can you do better if you have a structure over the values of these variables (either some of them, or all of them)? Remember: A correlation coefficient of 0 does not imply independence.
In order to do better, I think it's necessary to use some sort of continuity assumption.
---
I'm imagining an analog of the Chi-squared in 2 dimensions, in which you:
* estimate the joint density using kernels (i.e. fuzzy blobs at each data point, possibly sharper in higher-density regions (this would need some bootstrapping) )
* make a grid (i.e. a 2D histogram)
* perform a Chi-squared, using the "weight" of each little rectangle as the joint frequencies, and the weight of rows and columns as the marginal frequencies.
In the limit, as the 2D histogram gets very fine, I think this will converge to the "optimal" statistic. But degrees of freedom get pretty big that way. I don't know if that's a problem.
It seems I should be able to do better, by somehow getting rid of the arbitrariness of where histogram makes the cutoffs. Maybe the solution is to make the weight of a rectangle depend on the weight the neighboring rectangles in a fuzzy continuous way. But don't the kernels already do this? But to really be continuous, it seems like I should get rid of rectangles altogether.
Another idea is to modify the Chi-squared statistic so that it takes neighbouring points into account. But again, it seems like using kernels already does the same thing.