gusl: (Default)
[personal profile] gusl
Can one estimate the correlation between two variables when no joint data is observed?

Imagine you have samples from a 2D joint distribution... except that you don't have joint observations, because the x and y observations are unmatched. Your data can be plotted using vertical and horizontal lines (red lines on figure below).



My surprising realization was that the marginals give some evidence for how the x- and y- readings are matched (especially strong evidence when the true correlation is close to 1 or -1, i.e. mutual information is high)


A highly-regarded collaborator of mine didn't buy this at all, and I couldn't convince him even after several emails. So I created some R code for the n=3 case, and named it "magic-trick.R". The program generates a large number of coin tosses (+1 or -1), and "magically" guesses the sequence of tosses, using just the empirical marginals.

Y <- toss*X + c*N(0,1)        (signal + noise, if you will)
Y <- randomPermutation(Y)


When c is 0, it works 100%.
When c is 1, it only works ~53%.

If you are skeptical, perhaps my first step should be to convince you that on the figure above, a true correlation of 0.9 is much more likely than -0.9, e.g. by pointing out that the rectangles on the diagonal are much less eccentric under the correct matching. Looking at the figure below, we see that the true correlation could be +1, but definitely not -1:



Therefore, the unmatched data gives at least some information about the correlation.

As you get more data, the true matching becomes harder to get exactly right, but (hopefully) the correlation keeps getting easier (though perhaps at a very slow rate), ignoring computational issues.


Query:
* is there an identifiability / consistency result here? i.e. is there a consistent estimator for correlation, under reasonable assumptions about the joint pdf?
* more ambitiously, what families of joint densities can be estimated from unmatched data?

I suspect that the correlation is identifiable for 2D Gaussians.... which implies that multivariate normals can be estimated from unmatched data (since their parameters are just the correlations).

I suspect that non-identifiabilities arise when there exist x1 ≠ x2 for which the conditional distributions p(Y|x1) = p(Y|x2).... i.e. when the function λ(x).P(Y|x) is not injective.

However, as long the set of x's corresponding to any given p(Y|x) is a measure-zero set, some continuity assumptions might still give us full identifiability.




The following R code uses the protocol of forgetting the truth, as a guarantee that it's not cheating:


##################################################################
### magic-trick.R - inferring correlation from unmatched data ###
### by Gustavo Lacerda www.optimizelife.com ###
##################################################################

## PART 1 - generating data

nExpts <- 10000 ##number of experiments / number of coin tosses

randomPermutationMatrix <- function(n){
availableIndices <- 1:n
mat <- matrix(rep(0,n^2),nrow=n) ##n by n matrix, initialized with all zeros
for(i in 1:n){
r <- ceiling((n-i+1)*runif(1)) ##random number between 1 and n-i+1
j <- availableIndices[r]
availableIndices <- availableIndices[-r] ## now that we used this column, it's not available
mat[i,j] <- 1
}
mat
}

scramble <- function(v){
m <- randomPermutationMatrix(length(v))
as.vector(m %*% v)
}


n <- 3 ## with 3 points, we can get some information about the correlation; 2 isn't enough.
coeff <- c()
x <- list()
y <- list()

noiseCoeff <- 1 ## the 'c' from the formula

for (i in 1:nExpts){
coeff[i] <- sign(rnorm(1)) ## +1 or -1
x[[i]] <- rnorm(n)
y[[i]] <- coeff[i] * x[[i]] + noiseCoeff * rnorm(n)
##plot(x[[i]],y[[i]],main="'o' indicates the true matching")
y[[i]] <- scramble(y[[i]])
}

save(x,y, file="empirical-marginals.data")
save(coeff, file="coeff-truth.data")



## PART 2 - MAGIC TRICK
## reads the file 'empirical-marginals.data'; DOES NOT read 'coeff-truth.data'.
## This part runs a simple inference procedure, and gives you back 'coeff-predicted.data', containing a vector of nExpts entries (each +1 or -1)

rm(list=ls()) ##forget everything
load("empirical-marginals.data") ##remember one thing
nExpts <- length(x)

prediction <- c()
for (i in 1:nExpts){
xsorted <- sort(x[[i]],decreasing=FALSE)
ysorted <- sort(y[[i]],decreasing=FALSE)
xgap1 <- xsorted[2] - xsorted[1]
ygap1 <- ysorted[2] - ysorted[1]
##xgap2 <- xsorted[3] - xsorted[2]
ygap2 <- ysorted[3] - ysorted[2]

##if xgap1 is closer to ygap1 then guess +1; if it's closer to ygap2 then guess -1.
prediction[i] <- ifelse(abs(xgap1-ygap1) < abs(xgap1-ygap2),+1,-1)
}
save(prediction, file="coeff-predicted.data")



## PART 3 - evaluating: did the magic work?

rm(list=ls()) ##forget everything
load("coeff-truth.data")
load("coeff-predicted.data")
nExpts <- length(coeff)
correct <- 0
for (i in 1:nExpts){
if (prediction[i]==coeff[i])
correct <- correct+1
}

cat("The prediction was correct ", correct, " times out of ", nExpts, ".")


############# MAGIC TRICK IS OVER. THANKS! #############

(no subject)

Date: 2010-01-25 04:56 am (UTC)
From: [identity profile] http://users.livejournal.com/cunctator_/
Sounds cool and potentially applicable. Makes me think of partial identification and matching, the hottest topics in econometrics currently. You may skim http://faculty.wcas.northwestern.edu/%7Eett094/PIE.pdf if you have time.

(no subject)

Date: 2010-01-25 05:01 am (UTC)
From: [identity profile] gustavolacerda.livejournal.com
Thanks. Are you in econometrics?

(no subject)

Date: 2010-01-25 05:14 am (UTC)
From: [identity profile] http://users.livejournal.com/cunctator_/
Not yet :) In fact, there is a reference to a precisely the inequality you are looking for on page 9. Have you come to this using some application in mind or just out of general curiosity?

(no subject)

Date: 2010-01-25 05:19 am (UTC)
From: [identity profile] gustavolacerda.livejournal.com
general curiosity... though I think it was loosely inspired by another project (which I can't talk about here).

(no subject)

Date: 2010-01-25 05:29 am (UTC)
From: [identity profile] gustavolacerda.livejournal.com
That's great! Sec 2.3 describes exactly this setting.

... though I don't immediately understand the result they quote. Is K the pdf?


They reference:

Nelson (1999): "An Introduction to Copulas", which hopefully has proofs and decent notation. :-)

and if you can read French:
Fréchet (1951) - Les tableaux dont les marges sont données
Edited Date: 2010-01-25 05:38 am (UTC)

(no subject)

Date: 2010-01-25 05:56 am (UTC)
From: [identity profile] http://users.livejournal.com/cunctator_/
It make sense for a cdf. Here you have Nelson's exposition of the result: http://books.google.com/books?id=B3ONT5rBv0wC&lpg=PP1&ots=4watgYUVu-&dq=An%20Introduction%20to%20Copulas.&hl=en&pg=PA30#v=onepage&q=&f=false
So I hope you'll call me when you win an NSF grant or something :)

(no subject)

Date: 2010-01-27 09:36 am (UTC)
From: [identity profile] gustavolacerda.livejournal.com
I'm wondering if for any arrangement, the maximum likelihood matching under this model will tend to be obtained by sorting the two, namely either:
(a) matching the i-th X with the i-th Y, for every i; or
(b) matching the i-th X with the (n-i)-th Y, for every i.

(no subject)

Date: 2010-01-27 10:05 am (UTC)
From: [identity profile] gustavolacerda.livejournal.com
Very related question:

Q: assuming a positive correlation (i.e. toss = +1), on the bipartite diagram representing the matching, does the matching without crossings have the highest likelihood?

A: Yes. http://www.optimizelife.com/wiki/Matchings
Edited Date: 2010-06-28 05:11 am (UTC)

February 2020

S M T W T F S
      1
2345678
9101112131415
16171819202122
23242526272829

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags