190 likes | 375 Views
Basic Model For Genetic Linkage Analysis Lecture #3. Prepared by Dan Geiger. Using the Maximum Likelihood Approach. The probability of pedigree data Pr(data | ) is a function of the known and unknown recombination fractions denoted collectively by .
E N D
Basic Model For Genetic Linkage AnalysisLecture #3 . Prepared by Dan Geiger
Using the Maximum Likelihood Approach The probability of pedigree data Pr(data | ) is a function of the known and unknown recombination fractions denoted collectively by . How can we construct this likelihood function ? The maximum likelihood approach is to seek the value of which maximizes the likelihood function Pr(data | ) . This is the ML estimate.
Constructing the Likelihood function First, we need to determine the variables that describe the problem. There are many possible choices. Some variables we can observe and some we cannot. Lijm = Maternal allele at locus i of person j. The values of this variables are the possible alleles li at locus i. Lijf = Paternal allele at locus i of person j. The values of this variables are the possible alleles li at locus i (Same as for Lijm) . Xij= Unordered allele pair at locus i of person j. The values are pairs of ith-locus alleles (li,l’i). As a starting point, We assume that the data consists of an assignment to a subset of the variables {Xij}. In other words some (or all) persons are genotyped at some (or all) loci.
What is the relationships among the variables for a specific individual ? Maternal allele at locus 1 of person 1 Paternal allele at locus 1 of person 1 L11f L11m P(L11m = a) is the frequency of allele a. We use lower case letters for states writing, in short, P(l11m). Unordered allele pair at locus 1 of person 1 = data X11 P(x11 | l11m, l11f) = 0 or 1 depending on consistency
What is the relationships among the variables across individuals ? L11m L11f L12m L12f Mother Father X11 X12 L13f L13m Offspring X13 P(l13m | l11m, l11f) = 1/2 if l13m = l11m or l13m = l11f P(l13m | l11m, l11f) = 0 otherwise First attempt: correct but not efficient as we shall see.
L11m L11f L12m L12f X11 X12 L13f L13m X13 Model for locus 1 L21m L21f L22m L22f X21 X22 L23m depends on whether L13m got the value from L11m or L11f, whether a recombination occurred, and on the values of L21m and L21f. This is quite complex. L23f L23m X23 Model for locus 2 Probabilistic model for two loci
Adding a selector variable L11f L11m Selector of maternal allele at locus 1 of person 3 X11 S13m P(s13m) = ½ L13m Maternal allele at locus 1 of person 3 (offspring) Selector variables Sijm are 0 or 1 depending on whose allele is transmitted to offspring i at maternal locus j. P(l13m | l11m, l11f,,S13m=0) = 1 if l13m = l11m P(l13m | l11m, l11f,,S13m=1) = 1 if l13m = l11f P(l13m | l11m, l11f,,s13m) = 0 otherwise
L11m L11f L12m L12f X11 S13m X12 S13f L13f L13m X13 L21m L21f L22m L22f X21 S23m X22 S23f L23f L23m X23 Model for locus 2 Probabilistic model for two loci Model for locus 1
L21m L11m L21f L11f L12m L22m L12f L22f X21 X11 S13m S23m X22 X12 S13f S23f L13f L23f L13m L23m X13 X23 Probabilistic Model for Recombination is the recombination fraction between loci 2 & 1.
L11f L11m X11 S13m L13m P(l11m, l11f,, x11, s13m,l13m) = Joint probability P(l11m) P(l11f) P(x11 | l11m, l11f,) P(s13m)P(l13m | s13m, l11m, l11f) Probability of data (sum over all states of all hidden variables) Prob(data) = P(x11) = l11m l11f s13m l13m P(l11m, l11f,, x11, s13m,l13m) Constructing the likelihood function I Observed variable All other variables are not-observed (hidden)
Probability of data (sum over all states of all hidden variables) Prob(data| 2) = P(x11 ,x12 ,x13 ,x21 ,x22 ,x23) = Prob(data| 2) = P(x11 ,x12 ,x13 ,x21 ,x22 ,x23) = l11m, l11f … s23f [P(l11m) P(l11f) P(x11 | l11m, l11f,) … P(s13m) P(s13f) P(s23m |s13m, ) P(s23m |s13m, ) ] Constructing the likelihood function II P(l11m,l11f,x11,l12m,l12f,x12,l13m,l13f,x13, l21m,l21f,x21,l22m,l22f,x22,l23m,l23f,x23, s13m,s13f,s23m,s23f, ) = Product over all local probability tables = P(l11m) P(l11f) P(x11 | l11m, l11f,) … P(s13m) P(s13f) P(s23m |s13m, ) P(s23m |s13m, ) The result is a function of the recombination fraction. The ML estimate is the value that maximizes this function.
The Disease Locus I L11f L11m X11 S13m Y11 L13m Phenotype variables Yij are 0 or 1 depending on whether a phenotypic trait associated with locus i of person j is observed. E.g., sick versus healthy. For example model of perfect recessive disease yields the penetrance probabilities: P(y11 = sick | X11= (a,a)) = 1 P(y11 = sick | X11= (A,a)) = 0 P(y11 = sick | X11= (A,A)) = 0
The Disease Locus II L11f L11m X11 S13m Y11 L13m Note that in this model we assume the phenotype/disease depends only on the alleles of one locus. Also we did not model levels of sickness.
L11m L21m L21f L11f L12m L22m L12f L22f X21 X11 S23m S13m X22 X12 S23f S13f L13f L23f L13m L23m X13 X23 Introducing a tentative disease Locus Marker locus Disease locus: assume sick means xij=(a,a) Y21 Y22 Y23 The recombination fraction is unknown. Finding it can help determine whether a gene causing the disease lies in the vicinity of the marker locus.
Si3f Li2f Xi2 Li2m Li3f Xi3 Li3m Li1f Xi1 Li1m Si3m 2 3 4 1 Locus-by-Locus Summation order Sum over locus i vars before summing over locus i+1 vars Sum over orange vars (Lijt) before summing selector vars (Sijt). This order yields a Hidden Markov Model (HMM).
S1 S2 S3 Si-1 Si Si+1 R1 R2 R3 Ri-1 Ri Ri+1 X1 X1 X2 X2 X3 X3 Xi-1 Xi-1 Xi Xi Xi+1 Xi+1 Hidden Markov Models in General Which depicts the factorization: Application in communication: message sent is (s1,…,sm) but we receive (r1,…,rm) . Compute what is the most likely message sent ? Application in speech recognition: word said is (s1,…,sm) but we recorded (r1,…,rm) . Compute what is the most likely word said ? Application in Genetic linkage analysis: to be discussed now.
X1 X1 X2 X2 X3 X3 Xi-1 Xi-1 Xi Xi Xi+1 Xi+1 Hidden Markov Model In our case S1 S2 S3 Si-1 Si Si+1 X1 X2 X3 Yi-1 Xi Xi+1 The compounded variable Si = (Si,1,m,…,Si,2n,f)is called the inheritance vector. It has 22n states where n is the number of persons that have parents in the pedigree (non-founders). The compounded variable Xi = (Xi,1,m,…,Xi,2n,f) is the data regarding locus i. Similarly for the disease locus we use Yi. To specify the HMM we need to write down the transition matrices from Si-1 to Si and the matrices P(xi|Si). Note that these quantities have already been implicitly defined.
(The Kronecker product) For n non-founders, the transition matrix is the n-fold Kronecker product: The transition matrix Recall that: Note that theta depends on I but this dependence is omitted. In our example, where we have one non-founder (n=1), the transition probability table size is 4 4 = 22n 22n,encoding four options of recombination/non-recombination for the two parental meiosis:
L21m L21f L22m L22f X21 S23m X22 S23f = P(l21m)P(l21f)P(l22m)P(l22f) P(x21 | l21m, l21f) P(x22 | l22m, l22f) P(x23 | l23m, l23f) P(l23m | l21m, l21f, S23m) P(l23f | l22m, l22f, S23f) L23f L23m l21m,l21f,l22m,l22f l22m,l22f X23 Model for locus 2 Probability of data in one locus given an inheritance vector P(x21, x22 , x23 |s23m,s23f) = The five last terms are always zero-or-one, namely, indicator functions.