640 likes | 659 Views
Explore sequence motifs, representations, k-mer counting, and statistical modeling in bioinformatics with motivating examples. Understand generative and discriminative models while considering sequence motif models’ simplicity, interpretability, and generality. Learn about motif representations like position weight matrix and sequence logos. Discover methods like PWM matching score, k-mers, and g-gapped k-mer considerations.
E N D
Lecture 6. Sequence motif models and counting The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics
Lecture outline • Sequence motifs • Biological motivations • Representations • k-mer counting • Introduction to statistical modeling • Motivating examples • Generative and discriminative models • Classification and regression • Example: Naive Bayes classifier CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Part 1 Sequence Motifs
Sequence motifs • Many biological activities are facilitated by particular sequence patterns • The restriction enzyme EcoRI recognizes the DNA pattern GAATTC and cuts the DNA as follows: • The human protein GATA3 binds DNA at regions that exhibit the pattern AGTAAGA, where the G at position 6 can also be A, and the A at position 7 can also be G or C G CTTAA AATTC G CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Sequence motifs • In general, small recurrent patterns on biological sequences with particular functions are called sequence motifs • We need models to represent the motifs, usually based on some examples. Goals: • These models do not miss true occurrences (i.e., have low false negative rate), and do not include false occurrences (i.e., have low false positive rate) • These models should take uncertainty into account • These models should be as simple as possible • Computability • Interpretability • Generality CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Motif representations • Suppose we have the following sequences known to be bound by a protein: • CACAAAC • CACAAAT • CGCAAAC • CACAAAC • Consensus sequence: • CACAAAC • Problem: Information loss • Degenerate sequence in IUPAC (International Union of Pure and Applied Chemistry) code (see http://www.bio-soft.net/sms/iupac.html): • CRCAAAY Example source: http://conferences.computer.org/bioinformatics/CSB2003/NOTES/Liu_Color.pdf CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Motif representations • Suppose we have the following aligned TFBS sequences: • CACAAAAC • CACAAA_T • CGCAAAAC • CACAAA_C • Regular expression (see http://en.wikipedia.org/wiki/Regular_expression for syntax) • E.g., C[AG]CA{3,4}[CT] • Are there other possible regular expressions? CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Motif representations • Position weight matrix • Pseudo-counts: add a small number to each count, to alleviate problems due to small sample size ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG Example source: http://conferences.computer.org/bioinformatics/CSB2003/NOTES/Liu_Color.pdf CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Motif representations • Sequence logo • Nucleotide with the highest probability on top • Total height of the nucleotides at the i-th position, • pi,x: probability of character x at position i • n: number of sequences • Height of nucleotide x = pi,xhi CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Using a motif • Consensus sequence: • Predict “Yes” if a sequence matches the consensus sequence; “No” otherwise • Regular expression: • Predict “Yes” if a sequence can be generated by the regular expression; “No” otherwise • Position weight matrix: • Compute a matching score for a sequence, and consider a sequence to be more likely to belong to the class if it has a higher score CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
PWM matching score • Suppose the PWM of the binding sites of a protein is as follows: • For the sequence ATGGGGTG, the likelihood is 0.90.70.70.80.10.20.70.8 = 0.00395136 • Compute the odds against background probabilities of the four nucleotides: 0.00395136 / (pApG5pT2) • Usually take log2 of the odds as the final score CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
k-mers • Another way to represent sequence motifs: k-mers • Training examples: • ACCGCT • TACCGG • TTACCA • AACCTG • One vague way to summarize: “This motif is AC- and CC-rich” CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
k-mers • Considerations: • Value of k • Too small: • Capturing only local patterns • Too large: • Too restrictive • Too many possible k-mers (computationally difficult) • Allowing wildcards or not • g-gapped k-mer: among the g+k positions, only k of them are considered and the remaining g positions are ignored (Here “gapped” means unspecified positions in the pattern, i.e., wildcards. It does not mean indels.) • Representation and final use of the k-mers CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Problem to study here • Using g-gapped k-mer counts as features, compute similarity of two sequences as their inner product • Example (k=2, g=1) • Full set of g-gapped k-mers (* is the wildcard character, which can match any nucleotide): • *AA, *AC, *AG, ..., *TTA*A, A*C, A*G, ..., T*TAA*, AC*, AG*, ..., TT* • Number of possible g-gapped k-mers= k+gCk 4k = 3C2 42 = 48 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Problem to study here • Example (k=2, g=1) (cont’d) • Sequence s1 = ACCGCT • Sequence s2 = TACCGG • Similarity between s1 and s2, sim(s1,s2)= 00 + 01 + 00 + 00 + 00 + 11 + 11 + ...= 8 (see Excel file, also try to verify by yourself) • These similarity values can help separate sequences that belong to a class from those that do not CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Time complexity analysis • For two sequences each with n characters, using the brute-force way of calculation: • Filling each table takes (n-g-k+1) g+kCk = 3n-6 additions when k=2 and g=1 • Linear w.r.t. sequence length • g+kCk can be large when g is large • Computing the inner product takes k+gCk 4k = 48 multiplications, followed by 47 additions • Exponential w.r.t. k CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Speeding up the calculations • Ideas: • The exponential time complexity can be avoided only if sim(s1,s2) can be computed without filling in the two whole tables • When k is large, the tables contain many zeroes that can be ignored CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Using the ideas • Example (k=2, g=1) (cont’d) • Sequence s1 = ACCGCT • New representation: {*CC:1, *CG:1, *CT:1, *GC:1, A*C:1, C*C:1, C*G:1, G*T:1, AC*:1, CC*:1, CG*:1, GC*:1} • Sequence s2 = TACCGG • New representation: {*AC:1, *CC:1, *CG:1, *GG:1, A*C:1, C*G:2, T*C:1, AC*:1, CC*:1, CG*:1, TA*:1} • Looking for common g-gapped k-mers and multiplying the corresponding numbers, sim(s1,s2)= 1 (due to *CC) + 1 (*CG) + 1 (A*C) + 2 (C*G) + 1 (AC*) + 1 (CC*) + 1 (CG*) = 8 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Time complexity analysis • Suppose the new representations can be produced with the help of hash tables, the final calculation involves linear scan of the two lists, each with at most (n-g-k+1) g+kCk entries • (6-1-2+1) 3C2 = 12 entries when n=6, k=2, g=1 • Can be slow when g and k are large • For example, with k=6, g=8, g+kCk = 3003 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Speeding up further • Another idea: • Some g-gapped k-mers are related, and their corresponding calculations can be grouped • For example, s1[3-5] = CGC and s2[4-6] = CGG • g-gapped k-mers involved: • s1[3-5]: {*GC, C*C, CG*} • s2[4-6]: {*GG, C*G, CG*} • Similarity between s1 and s2 due to these sub-sequences: 1 (due to CG*) CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Speeding up further • Given two length-(k+g) sub-sequences from s1 and s2 (e.g., CGC and CGG), how much do they contribute to sim(s1,s2)? • Important observation: The answer depends only on their number of mismatches • In this case, there is one mismatch between CGC and CGG, and the corresponding contribution to the similarity between s1 and s2 is 1 • In the same way, between s1[2-4]=CCG and s2[4-6]=CGG, since they have one mismatch, the contribution is also 1 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Computing the contribution • For any two length-(k+g) sub-sequences s1[i1-j1] and s2[i2-j2] with m mismatches • There are in total k+gCk ways to generate g-gapped k-mers from each of them, by choosing k non-gapped positions • For a particular choice of the k positions, if they do not involve any of the mismatch positions, their contribution to sim(s1,s2) is 1 • Otherwise, their contribution is 0 • Therefore, their total contribution to sim(s1,s2) is the number of ways to choose the k positions such that none of them is a mismatch position • The total number of ways is k+g-mCk if k+g-mk (i.e., gm); 0 otherwise CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Computing the contribution • A bigger example: suppose k=2, g=2 • s1[2-5] = CCGC • s2[3-6] = CCGG • Previous way of calculating their contribution to sim(s1,s2) : • g-gapped k-mers of s1[2-5]: {**GC, *C*C, *CG*, C**C, C*G*, CC**} • g-gapped k-mers of s2[3-6]: {**GG, *C*G, *CG*, C**G, C*G*, CC**} • Contribution (number of common g-gapped k-mers): 3 • New way of calculating their contribution to sim(s1,s2): • Number of mismatches between s1[2-5] and s2[3-6]: 1 • Contribution: k+g-mCk = 3C2 = 3 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Complete algorithm • Extract all (k+g)-mers from s1 and s2 • For each pair of (k+g)-mer taken from s1 and s2 respectively, compute their contribution to sim(s1,s2) • Sum all these contributions to get final value of sim(s1,s2) CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Complete example Number of mismatches: • Back to k=2, g=1 • Sequence s1 = ACCGCT • Sequence s2 = TACCGG • Extract all 3-mers • s1: {ACC, CCG, CGC, GCT} • s2: {TAC, ACC, CCG, CGG} • For each pair of 3-mers,compute theircontribution to sim(s1,s2) • Therefore, sim(s1,s2) = 3+3+1+1=8 Contributions to sim(s1,s2) : CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Time complexity analysis • For each length-n sequence, there are n-k-g+1 sub-sequences of length k+g • Therefore, there are (n-k-g+1)2 pairs of (k+g)-mers from the two sequences • For each pair, the number of mismatches can be computed by scanning the two (k+g)-mer ones • Can speed up using bitwise XOR operations • The total amount of time required is (k+g)(n-k-g+1)2 • Depends more on n but not so much on k and g CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Speeding up even further? • Possible to avoid considering all (k+g)-mer pairs from the two sequences, but just those with less than g mismatches • Won’t go into the details here (see further readings) Image credit: Ghandi et al., PLOS Computational Biology 10(7):e1003711, (2014) CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Part 2 Introduction to Statistical Modeling
Statistical modeling • We have studied many biological concepts in this course • Genes, exons, introns, ... • We want to provide a description of a concept by means of some observable features • Sometimes it can be (more or less) an exact rule: • The enzyme EcoRI cuts the DNA if and only if it sees the sequence GAATTC • In most cases it is not exact: • If a sequence (1) starts with ATG, (2) ends with TAA, TAG or TGA, and (3) has a length about 1,500 and is a multiple of 3, it could be the protein coding sequence of a yeast gene • If the BRCA1 or BRCA2 gene is mutated, one may develop breast cancer CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
The examples • Reasons for the descriptions to be inexact: • Incomplete information • What mutations on BRCA1/BRCA2? Any mutations on other genes? • Exceptions • “If one has fever, he/she has a flu” – Not everyone with a flu has fever, also not everyone with fever is due to a flu • Intrinsic randomness CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Features known, concept unsure • In many cases, we are interested in the situation that the features are observed but whether a concept is true is unknown • We know the sequence of a DNA region, but we do not know whether it corresponds to a protein coding sequence • We know whether the BRCA1 and BRCA2 genes of a subject are mutated (and in which ways), but we do not know whether the subject has developed/will develop breast cancer • We know a subject is having fever, but we do not know whether he/she has flu infection or not CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Statistical models • Statistical models provide a principal way to specify the inexact descriptions • For the flu example, using some symbols: • X: a set of features • In this example, a single binary feature with X=1 if a subject has fever and X=0 if not • Y: the target concept • In this example, a binary concept with Y=1 if a subject has flu and Y=0 if not • A model is a function that predicts values of Y based on observed values X and parameters CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Parameters • Some details of a statistical model are provided by its parameters, • Suppose whether a person with flu has fever can be modeled as a Bernoulli (i.e., coin-flipping) event with probability q1, • That is, for each person with flu, the probability for him/her to have fever is q1 and the probability not to have fever is 1-q1. • Different people are assumed to be statistically independent. • Similarly, suppose whether a person without flu has fever can be modeled as a Bernoulli event with probability q2 • Finally, the probability for a person to have flu is p • Then the whole set of parameters is = {p, q1, q2} CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Basic probabilities • Pr(X)Pr(Y|X) = Pr(X and Y) • If there is a 20% chance to rain tomorrow, and whenever it rains, there is a 60% chance that the temperature will drop, then there is a 0.2*0.6=0.12 chance that tomorrow it will both rain and have a temperature drop • Capital letters mean it is true for all values of X and Y • Can also write Pr(X=x)Pr(Y=y|X=x) = Pr(X=x and Y=y) for particular values of X and Y • Law of total probability: (The summation should consider all possible values of Y) • If there is • A 0.12 chance that it will both rain and have a temperature drop tomorrow, and • A 0.08 chance that it will both rain and not have a temperature drop tomorrow • Then there is a 0.12+0.08 = 0.2 chance that it will rain tomorrow • Bayes’ rule: Pr(X|Y) = Pr(Y|X)Pr(X)/Pr(Y) when Pr(Y) 0 • Because Pr(X|Y)Pr(Y) = Pr(Y|X)Pr(X) = Pr(X and Y) • Similarly, Pr(X|Y,Z) = Pr(Y|X,Z)Pr(X|Z)/Pr(Y|Z) when Pr(Y|Z) 0 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
A complete numeric example • Assume the following parameters (X: has fever or not; Y: has flu or not): • 70% of people with flu have fever: Pr(X=1|Y=1) = 0.7 • 10% of people without flu have fever: Pr(X=1|Y=0) = 0.1 • 20% of people have flu: Pr(Y=1) = 0.2 • We have a simple model to predict Y from andX: • Probability that someone has fever: Pr(X=1) = Pr(X=1,Y=1) + Pr(X=1,Y=0)= Pr(X=1|Y=1)Pr(Y=1) + Pr(X=1|Y=0)Pr(Y=0)= (0.7)(0.2) + (0.1)(1-0.2) = 0.22 • Probability that someone has flu, given that he/she has fever: Pr(Y=1|X=1) = Pr(X=1|Y=1)Pr(Y=1)/Pr(X=1)= (0.7)(0.2) / 0.22 = 0.64 • Probability that someone does not have flu, given that he/she has fever: Pr(Y=0|X=1) = 1 - Pr(Y=1|X=1) = 0.36 • Probability that someone has flu, given that he/she does not have fever: Pr(Y=1|X=0) = Pr(X=0|Y=1)Pr(Y=1) / Pr(X=0)= [1 - Pr(X=1|Y=1)]Pr(Y=1) / [1 - Pr(X=1)]= (1 – 0.7)(0.2) / (1 – 0.22) = 0.08 • Probability that someone does not have flu, given that he/she does not have fever:Pr(Y=0|X=0) = 1 – Pr(Y=1|X=0) = 0.92 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Statistical estimation • Questions we can ask: • Given a model, what is the likelihood of the observation? • Pr(X|Y,) – in the previous page, was omitted for simplicity • If a person has flu, how likely would he/she have fever? • Given an observation, what is the probability that a concept is true? • Pr(Y|X,) • If a person has fever, what is the probability that he/she has flu? • Given some observations, what is the likelihood of a parameter value? • Pr(|X), or Pr(|X,Y) if whether the concept is true is also known • Suppose we have observed that among 100 people with flu, 70 have fever. What is the likelihood that q1 is equal to 0.7? CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Statistical estimation • Questions we can ask (cont’d): • Maximum likelihood estimation: Given a model with unknown parameter values, what parameter values can maximize the data likelihood? • or • Prediction of concept: Given a model and an observation, what is the concept most likely to be true? CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Generative vs. discriminative modeling • If a model predicts Y by providing information about Pr(X,Y), it is called a generative model • Because we can use the model to generate data • Examples: Naïve Bayes • If a model predicts Y by providing information about Pr(Y|X) directly without providing information about Pr(X,Y), it is called a discriminative model • Example: Logistic regression CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Classification vs. regression • If there is a finite number of discrete, mutually exclusive concepts, and we want to find out which one is true for an observation, it is a classification problem and the model is called a classifier • Given that the BRCA1 gene of a subject has a deleted exon 2, we want to predict whether the subject will develop breast cancer in the life time • Y=1: the subject will develop breast cancer; • Y=0: the subject will not develop breast cancer • If Y takes on continuous values, it is a regression problem and the model is called an estimator • Given that the BRCA1 gene of a subject has a deleted exon 2, we want to estimate the lifespan of the subject • Y: lifespan of the subject CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Bayes classifiers • In the example of flu (Y) and fever (X), we have seen that if we know Pr(X|Y) and Pr(Y), we can determine Pr(Y|X) by using Bayes’ rule: • We use capital letter to represent variables (single-valued or vector), and small letters to represent values • When we do not specify the value, it means something is true for all values. For example, all the followings are true according to Bayes’ rule: • Pr(Y=1|X=1) = Pr(X=1|Y=1) Pr(Y=1) / Pr(X=1) • Pr(Y=1|X=0) = Pr(X=0|Y=1) Pr(Y=1) / Pr(X=0) • Pr(Y=0|X=1) = Pr(X=1|Y=0) Pr(Y=0) / Pr(X=1) • Pr(Y=0|X=0) = Pr(X=0|Y=0) Pr(Y=0) / Pr(X=0) CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Terminology • Pr(Y) is called the prior probability • E.g., Pr(Y=1) is the probability of having flu, without considering any evidence such as fever • Can be considered the prior guess that the concept is true before seeing any evidence • Pr(X|Y) is called the likelihood • E.g., Pr(X=1|Y=1) is the probability of having fever if we know one has flu • Pr(Y|X) is called the posterior probability • E.g., Pr(Y=1|X=1) is the probability of having flu, after knowing that one has fever CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Generalizations • In general, the above is true even if: • X involves a set of features X={X(1), X(2), ..., X(m)} instead of a single feature • Example: predict whether one has flu after knowing whether he/she has fever, headache and running nose • X can take on more than 2 values, or even continuous values • In the latter case, Pr(X) is the probability density of X • Examples: • Predict whether a person has flu after knowing the number of times he/she has coughed today • Predict whether a person has flu after knowing his/her body temperature CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Parameter estimation • Let’s consider the discrete case first • Suppose we want to estimate the parameters of our flu model by learning from a set of known examples, (X1, Y1), (X2, Y2), ..., (Xn, Yn) – the training set • How many parameters are there in the model? • We need to know the prior probabilities, Pr(Y) • Two parameters: Pr(Y=1), Pr(Y=0) • Since Pr(Y=1) = 1 - Pr(Y=0), only one independent parameter • We need to know the likelihoods, Pr(X|Y) • Suppose we have m binary features, fever, headache, running nose, ... • 2m+1 parameters for all X and Y value combinations • 2(2m-1) independent parameters since for each value y of Y, sum of all Pr(X=x|Y=y) is one • Total: 2(2m-1) + 1 independent parameters • How large should n be in order to estimate these parameters accurately? • Very large, given the exponential number of parameters CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
List of all the parameters • Let Y be having flu (Y=1) or not (Y=0) • Let X(1) be having fever (X(1)=1) or not (X(1)=0) • Let X(2) be having headache (X(2)=1) or not (X(2)=0) • Let X(3) be having running nose (X(3)=1) or not (X(3)=0) • Then the complete list of parameters for a generative model is (variables not independent are in gray): • Pr(Y=0), Pr(Y=1) • Pr(X(1)=0, X(2)=0, X(3)=0,|Y=0), Pr(X(1)=0, X(2)=0, X(3)=1,|Y=0), Pr(X(1)=0, X(2)=1, X(3)=0,|Y=0),Pr(X(1)=0, X(2)=1, X(3)=1,|Y=0), Pr(X(1)=1, X(2)=0, X(3)=0,|Y=0), Pr(X(1)=1, X(2)=0, X(3)=1,|Y=0), Pr(X(1)=1, X(2)=1, X(3)=0,|Y=0), Pr(X(1)=1, X(2)=1, X(3)=1,|Y=0) • Pr(X(1)=0, X(2)=0, X(3)=0,|Y=1), Pr(X(1)=0, X(2)=0, X(3)=1,|Y=1), Pr(X(1)=0, X(2)=1, X(3)=0,|Y=1),Pr(X(1)=0, X(2)=1, X(3)=1,|Y=1), Pr(X(1)=1, X(2)=0, X(3)=0,|Y=1), Pr(X(1)=1, X(2)=0, X(3)=1,|Y=1), Pr(X(1)=1, X(2)=1, X(3)=0,|Y=1), Pr(X(1)=1, X(2)=1, X(3)=1,|Y=1) CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Why having many parameters is a problem? • Statistically, we will need a lot of data to accurately estimate the values of the parameters • Imagine that we need to estimate the 15 parameters on the last page with only data about 20 people • Computationally, estimating the values of an exponential number of parameters could take a long time CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Conditional independence • One way to reduce the number of parameters is to assume conditional independence: If X(1) and X(2) are two features, then • Pr(X(1), X(2)|Y)= Pr(X(1)|Y,X(2))Pr(X(2)|Y) [Standard probability]= Pr(X(1)|Y)Pr(X(2)|Y) [Conditional independence assumption] • Probability for a flu patient to have fever is independent of whether he/she has running nose • Important: This does not imply unconditional independence, i.e., Pr(X(1)) and Pr(X(2)) are not assumed independent, and thus we cannot say Pr(X(1), X(2)) = Pr(X(1))Pr(X(2)) • Without knowing whether a person has flu, having fever and having running nose are definitely correlated CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Conditional independence and Naïve Bayes • Number of parameters after making the conditional independence assumption: • 2 prior probabilities Pr(Y=0) and Pr(Y=1) • Only 1 independent parameter, as Pr(Y=1) = 1 – Pr(Y=0) • 4m likelihoods Pr(X(j)=x|Y=y) for all possible values of j, x and y • Only 2m independent parameters, as Pr(X(j)=1|Y=y) = 1 - Pr(X(j)=0|Y=y) for all possible values of j and y • Total: 2m+1 independent parameters, which is much smaller than 2(2m-1)+1! • The resulting model is usually called a Naïve Bayes model CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Estimating the parameters • Now, suppose we have the known examples (X1, Y1), (X2, Y2), ..., (Xn, Yn) in the training set • The prior probabilities can be estimated in this way: • , where 𝕀 is the indicator function,with𝕀(true) = 1 and 𝕀(false) = 0 • That is , fraction of examples with class label y • Similarly, for any particular feature X(j), its likelihoods can be estimated in this way: • That is, fraction of class y examples having value x at feature X(j) • To avoid zeros, we can add pseudo-counts: • , where c has a small value CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Example • Suppose we have the training data as shown on the right • How many parameters does the Naïve Bayes model have? • Estimated parameter values using the formulas on the last page: • Pr(Y=1) = 3/8 • Pr(X(1)=1|Y=1) = 2/3 • Pr(X(1)=1|Y=0) = 2/5 • Pr(X(2)=1|Y=1) = 1/3 • Pr(X(2)=1|Y=0) = 1/5 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Meaning of the estimations • The formulas for estimating the parameters are intuitive • In fact they are also the maximum likelihood estimators – the values that maximize the likelihood if we assume the data were generated by independent Bernoulli trials • Let q=Pr(X(j)=1|Y=1) be the probability for a flu patient to have fever • This likelihood can be expressed as • That is, if a flu patient has fever, we include a q to the product; If a flu patient does not have fever, we include a 1-q to the product • Finding the value of q that maximizes the likelihood is equivalent to finding the q that maximizes the logarithm of it, since logarithm is an increasing function (a > b ln a > ln b) • This value can be found by differentiating the log likelihood and equating it to zero: • The formula for estimating the prior probabilities Pr(Y) can be similarly derived CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019