gusl | most of an introduction

Comments are welcome! Especially if some of the biology is wrong.

\title{Chapter 1 -- Introduction: The Biological Problem}

\begin{document}

\section{The basics of molecular biology}

One basic fact of molecular biology is that bits of DNA, known as genes, encode molecules, known as proteins, through a process of transcription and translation, which has become known as the ``central dogma of molecular biology'' (F.H.C. Crick 1958; ``On Protein Synthesis. Symp. Soc. Exp. Biol. XII, 139-163.''). Proteins, in turn, work together to form protein complexes, which accomplish a cellular function, such as cellular transport or growth (reproduction), producing phenotypes (roughly, ``visible traits''). When two proteins work together, we say that their respective genes belong to the same ``module''.

For a variety of reasons, biologists are interested in knowing which genes work together in modules. Now, much, if not most of molecular biology follows the principles of reverse engineering, especially: that the easiest way to figure out how a machine works is to try to break it.

Following this tradition are so-called gene-knockout experiments: one produces mutant cells by disabling single genes with simple mutations, before placed in a position to found its own cell culture (or die). A mutated gene will, in most cases, either produce a broken protein or no protein at all (otherwise, the gene cannot be said to be ``broken''); and, this way, the resulting protein complexes and pathways lose functionality.

\section{Measurements and phenotypical profiles}

A gene-knockout experiment can be single-knockout, which means that a single gene is mutated; or multiple-knockout. In either case, among measured phenotypes we have things like growth rates (i.e. cell reproduction rates), and measurements of how broken the cell's protein-sorting is. Furthermore, these measurements are typically made under many experimental conditions, and exposure to a variety of challenging conditions, involving temperature(?), salinity(?) and exposure to various nasty chemicals.

The collection of these measurements for a given gene is a numeric vector known as the ``gene profile''. If we look at two genes that encode proteins in the same protein complex, they will tend to have similar profiles, because they break the same protein complex, leaving the cell with similar functionality and vulnerabilities.

\section{Analyzing profiles, clustering}

Pairwise similarity measures have been used as inputs to hierarchical clustering, but rather than partition the set of genes (into a block structure), such algorithms produce dendrograms (i.e. binary trees).

There are approaches based on thresholding the similarity values, but besides not being very principled, they have many drawbacks and practical problems, such as not using all the data, and the difficulty of choosing a threshold. (Brumm 2008)

This thesis uses probabilistic models for edge rankings, inspired by Brumm's ``Finding Functional Groups of genes using pairwise relational data: methods and applications'', (Brumm 2008, PhD thesis); redefines them to be more consistent with standard Bayesian methodology, defines search spaces, implements search strategies, and evaluates them on real as well as simulated data.

\section{Data structures}

Besides the gene profile data (edge ranking data), we also use gene network data. Edge ranking data contains more information than network data: by thresholding at any rank, we obtain a network.

Cite:
Discovery and Expansion of Gene Modules by Seeking Isolated Groups in a Random Graph Process http://www.plosone.org/article/info:doi/10.1371/journal.pone.0003358

\end{document}

S	M	T	W	T	F	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29

Gustavo Lacerda

most of an introduction

Profile

February 2020

Most Popular Tags

Style Credit

Expand Cut Tags