gusl | human population genetics

You're viewing

gusl's journal
Create a Dreamwidth Account Learn More

Reload page in style: site light

gusl

This graph (link via

patrissimo) suggests that, using these two features (Factors 1 and 2, extracted from the data), one can linearly separate Spanish, Italian and Portuguese genomes with a high degree of confidence. This data is just asking for an SVM. I find this surprising. I expected the historical mixing rate to be too high to allow this.

Flat | Top-Level Comments Only

From:

neelk

If you go down, you'll see them talking about SNPs, which stand for "single nucleotide polymorphisms". A SNP is a one-nucleotide mutation, which arises from imperfections in the DNA copying process. This gives rise to a really good way of figuring out an actual family tree, so it would be surprising-ish if they couldn't get results this good.

Here's a highly simplified version of how it works. Suppose that we've got a cell whose DNA has 10 bits in it, and a 90% correct copying rate. (I'm turning base pairs into 0/1 pairs for a further simplification.) Now, suppose you get the following three sequences, and you know that there is one parent and two children. Can you figure out which one is most likely to be the parent, and which two are the sibling children?

a) 0000000000
b) 1000000000
c) 0100000000

It's easy to use the binomial theorem to work out that a) is most likely to be the parent, and b) and c) are children, because going from a) to b) or c) requires one transcription error each, but going from c) to b) would require two transcription errors, which is much less likely.

If you need to do this for real, you have to worry about gene frequencies, sexual reproduction, and not having a complete set of the ancestries, and similar concerns, which turns it into interesting machine learning/algorithms research. But the basic idea is that by looking at SNPs, you can work out ancestries very effectively.