gusl | estimating correlations from unmatched data

Can one estimate the correlation between two variables when no joint data is observed?

Imagine you have samples from a 2D joint distribution... except that you don't have joint observations, because the x and y observations are unmatched. Your data can be plotted using vertical and horizontal lines (red lines on figure below).

My surprising realization was that the marginals give some evidence for how the x- and y- readings are matched (especially strong evidence when the true correlation is close to 1 or -1, i.e. mutual information is high)

A highly-regarded collaborator of mine didn't buy this at all, and I couldn't convince him even after several emails. So I created some R code for the n=3 case, and named it "magic-trick.R". The program generates a large number of coin tosses (+1 or -1), and "magically" guesses the sequence of tosses, using just the empirical marginals.

Y <- toss*X + c*N(0,1) (signal + noise, if you will)
Y <- randomPermutation(Y)

When c is 0, it works 100%.
When c is 1, it only works ~53%.

If you are skeptical, perhaps my first step should be to convince you that on the figure above, a true correlation of 0.9 is much more likely than -0.9, e.g. by pointing out that the rectangles on the diagonal are much less eccentric under the correct matching. Looking at the figure below, we see that the true correlation could be +1, but definitely not -1:

Therefore, the unmatched data gives at least some information about the correlation.

As you get more data, the true matching becomes harder to get exactly right, but (hopefully) the correlation keeps getting easier (though perhaps at a very slow rate), ignoring computational issues.

Query:
* is there an identifiability / consistency result here? i.e. is there a consistent estimator for correlation, under reasonable assumptions about the joint pdf?
* more ambitiously, what families of joint densities can be estimated from unmatched data?

I suspect that the correlation is identifiable for 2D Gaussians.... which implies that multivariate normals can be estimated from unmatched data (since their parameters are just the correlations).

I suspect that non-identifiabilities arise when there exist x1 ≠ x2 for which the conditional distributions p(Y|x1) = p(Y|x2).... i.e. when the function λ(x).P(Y|x) is not injective.

However, as long the set of x's corresponding to any given p(Y|x) is a measure-zero set, some continuity assumptions might still give us full identifiability.

The following R code uses the protocol of forgetting the truth, as a guarantee that it's not cheating:


##################################################################
### magic-trick.R - inferring correlation from unmatched data  ###
### by Gustavo Lacerda             www.optimizelife.com        ###
##################################################################

## PART 1 - generating data

nExpts <- 10000 ##number of experiments / number of coin tosses

randomPermutationMatrix <- function(n){
  availableIndices <- 1:n
  mat <- matrix(rep(0,n^2),nrow=n) ##n by n matrix, initialized with all zeros
  for(i in 1:n){
    r <- ceiling((n-i+1)*runif(1)) ##random number between 1 and n-i+1 
    j <- availableIndices[r]
    availableIndices <- availableIndices[-r] ## now that we used this column, it's not available
    mat[i,j] <- 1
  }
  mat
}

scramble <- function(v){
  m <- randomPermutationMatrix(length(v))
  as.vector(m %*% v)
}


n <- 3 ## with 3 points, we can get some information about the correlation; 2 isn't enough.
coeff <- c()
x <- list()
y <- list()

noiseCoeff <- 1 ## the 'c' from the formula

for (i in 1:nExpts){
  coeff[i] <- sign(rnorm(1)) ## +1 or -1
  x[[i]] <- rnorm(n)
  y[[i]] <- coeff[i] * x[[i]] + noiseCoeff * rnorm(n)
  ##plot(x[[i]],y[[i]],main="'o' indicates the true matching")
  y[[i]] <- scramble(y[[i]])
}

save(x,y, file="empirical-marginals.data")
save(coeff, file="coeff-truth.data")



## PART 2 - MAGIC TRICK
## reads the file 'empirical-marginals.data'; DOES NOT read 'coeff-truth.data'.
## This part runs a simple inference procedure, and gives you back 'coeff-predicted.data', containing a vector of nExpts entries (each +1 or -1)

rm(list=ls()) ##forget everything
load("empirical-marginals.data") ##remember one thing
nExpts <- length(x)
  
prediction <- c()
for (i in 1:nExpts){
  xsorted <- sort(x[[i]],decreasing=FALSE)
  ysorted <- sort(y[[i]],decreasing=FALSE)  
  xgap1 <- xsorted[2] - xsorted[1]
  ygap1 <- ysorted[2] - ysorted[1]
  ##xgap2 <- xsorted[3] - xsorted[2]
  ygap2 <- ysorted[3] - ysorted[2]

  ##if xgap1 is closer to ygap1 then guess +1; if it's closer to ygap2 then guess -1.
  prediction[i] <- ifelse(abs(xgap1-ygap1) < abs(xgap1-ygap2),+1,-1)
}
save(prediction, file="coeff-predicted.data")



## PART 3 - evaluating: did the magic work?

rm(list=ls()) ##forget everything
load("coeff-truth.data")
load("coeff-predicted.data")
nExpts <- length(coeff)
correct <- 0
for (i in 1:nExpts){
  if (prediction[i]==coeff[i])
    correct <- correct+1
}

cat("The prediction was correct ", correct, " times out of ", nExpts, ".")


############# MAGIC TRICK IS OVER. THANKS! #############

Flat | Top-Level Comments Only

From:

http://users.livejournal.com/cunctator_/

Sounds cool and potentially applicable. Makes me think of partial identification and matching, the hottest topics in econometrics currently. You may skim http://faculty.wcas.northwestern.edu/%7Eett094/PIE.pdf if you have time.

gustavolacerda.livejournal.com

Thanks. Are you in econometrics?

Not yet :) In fact, there is a reference to a precisely the inequality you are looking for on page 9. Have you come to this using some application in mind or just out of general curiosity?

general curiosity... though I think it was loosely inspired by another project (which I can't talk about here).

That's great! Sec 2.3 describes exactly this setting.

... though I don't immediately understand the result they quote. Is K the pdf?

They reference:

Nelson (1999): "An Introduction to Copulas", which hopefully has proofs and decent notation. :-)

and if you can read French:
Fréchet (1951) - Les tableaux dont les marges sont données

Edited Date: 2010-01-25 05:38 am (UTC)

It make sense for a cdf. Here you have Nelson's exposition of the result: http://books.google.com/books?id=B3ONT5rBv0wC&lpg=PP1&ots=4watgYUVu-&dq=An%20Introduction%20to%20Copulas.&hl=en&pg=PA30#v=onepage&q=&f=false
So I hope you'll call me when you win an NSF grant or something :)

I'm wondering if for any arrangement, the maximum likelihood matching under this model will tend to be obtained by sorting the two, namely either:
(a) matching the i-th X with the i-th Y, for every i; or
(b) matching the i-th X with the (n-i)-th Y, for every i.

Very related question:

Q: assuming a positive correlation (i.e. toss = +1), on the bipartite diagram representing the matching, does the matching without crossings have the highest likelihood?

A: Yes. http://www.optimizelife.com/wiki/Matchings

Edited Date: 2010-06-28 05:11 am (UTC)

S	M	T	W	T	F	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29

Gustavo Lacerda

estimating correlations from unmatched data

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

Profile

February 2020

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags