human population genetics
Nov. 28th, 2007 05:31 pmThis graph (link via
patrissimo) suggests that, using these two features (Factors 1 and 2, extracted from the data), one can linearly separate Spanish, Italian and Portuguese genomes with a high degree of confidence. This data is just asking for an SVM. I find this surprising. I expected the historical mixing rate to be too high to allow this.
(no subject)
Date: 2007-11-29 05:39 pm (UTC)Here's a highly simplified version of how it works. Suppose that we've got a cell whose DNA has 10 bits in it, and a 90% correct copying rate. (I'm turning base pairs into 0/1 pairs for a further simplification.) Now, suppose you get the following three sequences, and you know that there is one parent and two children. Can you figure out which one is most likely to be the parent, and which two are the sibling children?
a) 0000000000
b) 1000000000
c) 0100000000
It's easy to use the binomial theorem to work out that a) is most likely to be the parent, and b) and c) are children, because going from a) to b) or c) requires one transcription error each, but going from c) to b) would require two transcription errors, which is much less likely.
If you need to do this for real, you have to worry about gene frequencies, sexual reproduction, and not having a complete set of the ancestries, and similar concerns, which turns it into interesting machine learning/algorithms research. But the basic idea is that by looking at SNPs, you can work out ancestries very effectively.