gusl: (Default)
[personal profile] gusl
Given a dataset with 3 variables A, B, C, you observe a value for the correlation between A and B, and another value for the correlation between B and C.

Question: what are the possible values of the correlation between A and C?

Answer (due to [livejournal.com profile] en_ki): if your dataset has n points, center it, and consider A, B and C as vectors in ℜn. The empirical correlation between two variables is the cosine of the angle between them.

Thus the question becomes easy: what are the possible values of the angle AC, given the angles AB and BC?


---

Original post, for the record:

Is there a sort of triangle inequality for correlations? I'd like to get a lower bound on R(A,C) given R(A,B) and R(B,C). Imagine the latter two as being close to 0.95: is it possible for R(A,B) to be smaller than 0.5? I think not, but don't have a proof.

Let me explore.


The (simple estimator for) sample covariance (the population covariance is the limit, when the sample is infinite):

Cov(A,B) = 1/n (SUM_i (A_i - mu_A) (B_i - mu_B))


Correlations don't change if we rescale the variances to be 1:

R(A,B) = 1/n (SUM_i (A_i - mu_A) (B_i - mu_B))

R(B,C) = 1/n (SUM_i (B_i - mu_B) (C_i - mu_C))

R(A,C) = 1/n (SUM_i (A_i - mu_A) (C_i - mu_C))


Correlations don't change if we recenter so that the means are 0:

R(A,B) = 1/n (SUM_i (A_i B_i))

R(B,C) = 1/n (SUM_i (B_i C_i))

R(A,C) = 1/n (SUM_i (A_i C_i))



So how could I prove any sort of triangle inequality here?


Let n = 1, then all 3 correlations are 0.

Let n = 2, then
R(A,B) = 1/2 (a1b1 + a2b2)
R(B,C) = 1/2 (b1c1 + b2c2)
R(A,C) = 1/2 (a1c1 + a2c2)

But since the means are 0:
a2 = -a1
b2 = -b1
c2 = -c1

So:
R(A,B) = 1/2 (a1b1 + a1b1) = a1b1
R(B,C) = 1/2 (b1c1 + b1c1) = b1c1
R(A,C) = 1/2 (a1c1 + a1c1) = a1c1

Since variance is 1, < R(A,B) , R(B,C) , R(A,C) > is in {1, -1}^3, i.e. each possible assignment is a corners of the cube. The constraint rules out some possibilities.

e.g.:
R(A,B) = +1
R(B,C) = +1
-----------------------------------------
a1=1, b1=1, c1=1 \/ a1=-1, b1=-1, c1=-1
----------------------------------------
R(A,C) = +1


But this is hardly a triangle inequality.


Let n = 3:

R(A,B) = 1/3 (a1b1 + a2b2 + a3b3)
R(B,C) = 1/3 (b1c1 + b2c2 + b3c3)
R(A,C) = 1/3 (a1c1 + a2c2 + a3c3)

Since the means are zero:
a3 = -a2-a1
b3 = -b2-b1
c3 = -c2-c1

So:
R(A,B) = 1/3 (a1b1 + a2b2 + (-a2-a1)(-b2-b1)) = 1/3 (a1b1 + a2b2 + a2b2 + a1b1 + a1b2 + a2b1) = 1/3 (2 a1b1 + 2 a2b2 + a1b2 + a2b1)

Similarly:
R(B,C) = 1/3 (2 b1c1 + 2 b2c2 + b1c2 + b2c1)
R(A,C) = 1/3 (2 a1c1 + 2 a2c2 + a1c2 + a2c1)

Can we prove any interesting bounds algebraically from the above?

Here's a constraint on each variable, since the variance is 1. Since a1^2 + a2^2 + a3^2 = 1, it follows that:
a1^2 + a2^2 + (-a2-a1)^2 = 1
a1^2 + a2^2 + a1^2 + a2^2 + 2a1a2 = 1
2 a1^2 + 2 a2^2 + 2a1a2 = 1)

Coming up with a hypothesis, proving it for n=3, and the induction step are left as an exercise for the reader (and the writer too). Possibly a challenging one.
(will be screened)
(will be screened if not validated)
If you don't have an account you can create one now.
HTML doesn't work in the subject.
More info about formatting

February 2020

S M T W T F S
      1
2345678
9101112131415
16171819202122
23242526272829

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags