540 likes | 694 Views
Lecture 7. Basic statistical modeling. The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics. Lecture outline. Introduction to statistical modeling Motivating examples Generative and discriminative models Classification and regression Bayes and Naïve Bayes classifiers
E N D
Lecture 7. Basic statistical modeling The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics
Lecture outline • Introduction to statistical modeling • Motivating examples • Generative and discriminative models • Classification and regression • Bayes and Naïve Bayes classifiers • Logistic regression CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014
Part 1 Introduction to Statistical Modeling
Statistical modeling • We have studied many biological concepts in this course • Genes, exons, introns, protein binding sites, ... • We want to provide a description of a concept by means of some observable features • Sometimes it can be (more or less) an exact rule: • The enzyme EcoRI cuts the DNA if and only if it sees the sequence GAATTC • In most cases it is not exact: • If a sequence (1) starts with ATG, (2) ends with TAA, TAG or TGA, and (3) has a length about 1,500 and is a multiple of 3, it could be the protein coding sequence of a yeast gene • If the BRCA1 or BRCA2 gene is mutated, one may develop breast cancer CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014
The examples • Reasons for the descriptions to be inexact: • Incomplete information • What mutations on BRCA1/BRCA2? Any mutations on other genes? • Exceptions • “If one has fever, he/she has a flu” – Not everyone with a flu has fever, also not everyone with fever is due to a flu • Intrinsic randomness CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014
Features known, concept unsure • In many cases, we are interested in the situation that the features are observed but whether a concept is true is unknown • We know the sequence of a DNA region, but we do not know whether it corresponds to a protein coding sequence • We know whether the BRCA1 and BRCA2 genes of a subject are mutated (and in which ways), but we do not know whether the subject has developed/will develop breast cancer • We know a subject is having fever, but we do not know whether he/she has flu infection or not CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014
Statistical models • Statistical models provide a principal way to specify the inexact descriptions • For the flu example, using some symbols: • X: a set of features • In this example, a single binary feature with X=1 if a subject has fever and X=0 if not • Y: the target concept • In this example, a binary concept with Y=1 if a subject has flu and Y=0 if not • A model is a function that predicts values of Y based on observed values X and parameters • We have learned one type of statistical models before: HMM CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014
Parameters • Some details of a statistical model are provided by its parameters, (I used in the HMM lecture) • Suppose whether a person with flu has fever can be modeled as a Bernoulli (i.e., coin-flipping) event with probability q1, • That is, for each person with flu, the probability for him/her to have fever is q1 and the probability not to have fever is 1-q1. • Different people are assumed to be statistically independent. • Similarly, suppose whether a person without flu has fever can be modeled as a Bernoulli event with probability q2 • Finally, the probability for a person to have flu is p • Then the whole set of parameters is = {p, q1, q2} CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014
A complete numeric example • Assume the following parameters (X: has fever or not; Y: has flu or not): • 70% of people with flu have fever: Pr(X=1|Y=1) = 0.7 • 10% of people without flu have fever: Pr(X=1|Y=0) = 0.1 • 20% of people have flu: Pr(Y=1) = 0.2 • We have a simple model to predict Y from andX: • Probability that someone has fever: Pr(X=1) = Pr(X=1,Y=1) + Pr(X=1,Y=0)= Pr(X=1|Y=1)Pr(Y=1) + Pr(X=1|Y=0)Pr(Y=0)= (0.7)(0.2) + (0.1)(1-0.2) = 0.22 • Probability that someone has flu, given that he/she has fever: Pr(Y=1|X=1) = Pr(X=1|Y=1)Pr(Y=1)/Pr(X=1)= (0.7)(0.2) / 0.22 = 0.64 • Probability that someone does not have flu, given that he/she has fever: Pr(Y=0|X=1) = 1 - Pr(Y=1|X=1) = 0.36 • Probability that someone has flu, given that he/she does not have fever: Pr(Y=1|X=0) = Pr(X=0|Y=1)Pr(Y=1) / Pr(X=0)= [1 - Pr(X=1|Y=1)]Pr(Y=1) / [1 - Pr(X=1)]= (1 – 0.7)(0.2) / (1 – 0.22) = 0.08 • Probability that someone does not have flu, given that he/she does not have fever:Pr(Y=0|X=0) = 1 – Pr(Y=1|X=0) = 0.92 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014
Statistical estimation • Questions we can ask: • Given a model, what is the likelihood of the observation? • Pr(X|Y,) – in the previous page, was omitted for simplicity • If a person has flu, how likely would he/she have fever? • Given an observation, what is the probability that a concept is true? • Pr(Y|X,) • If a person has fever, what is the probability that he/she has flu? • Given some observations, what is the likelihood of a parameter value? • Pr(|X), or Pr(|X,Y) if whether the concept is true is also known • Suppose we have observed that among 100 people with flu, 70 have fever. What is the likelihood that q1 is equal to 0.7? CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014
Statistical estimation • Questions we can ask (cont’d): • Maximum likelihood estimation: Given a model with unknown parameter values, what parameter values can maximize the data likelihood? • or • Prediction of concept: Given a model and an observation, what is the concept most likely to be true? CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014
Generative vs. discriminative modeling • If a model predicts Y by providing information about Pr(X,Y), it is called a generative model • Because we can use the model to generate data • Examples: HMM, Naïve Bayes • If a model predicts Y by providing information about Pr(Y|X) directly without providing information about Pr(X,Y), it is called a discriminative model • Example: Logistic regression CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014
Classification vs. regression • If there is a finite number of discrete, mutually exclusive concepts, and we want to find out which one is true for an observation, it is a classification problem and the model is called a classifier • Given that the BRCA1 gene of a subject has a deleted exon 2, we want to predict whether the subject will develop breast cancer in the life time • Y=1: the subject will develop breast cancer; • Y=0: the subject will not develop breast cancer • If Y takes on continuous values, it is a regression problem and the model is called an estimator • Given that the BRCA1 gene of a subject has a deleted exon 2, we want to estimate the lifespan of the subject • Y: lifespan of the subject CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014
Part 2 Bayes and Naïve Bayes Classifiers
Bayes classifiers • In the example of flu (Y) and fever (X), we have seen that if we know Pr(X|Y) and Pr(Y), we can determine Pr(Y|X) by using Bayes’ rule: • We use capital letter to represent variables (single-valued or vector), and small letters to represent values • When we do not specify the value, it means something is true for all values. For example, all the following are true according to Bayes’ rule: • Pr(Y=1|X=1) = Pr(X=1|Y=1) Pr(Y=1) / Pr(X=1) • Pr(Y=1|X=0) = Pr(X=0|Y=1) Pr(Y=1) / Pr(X=0) • Pr(Y=0|X=1) = Pr(X=1|Y=0) Pr(Y=0) / Pr(X=1) • Pr(Y=0|X=0) = Pr(X=0|Y=0) Pr(Y=0) / Pr(X=0) CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014
Conventions • Pr(Y) is called the prior probability • E.g., Pr(Y=1) is the probability of having flu, without considering any evidence such as fever • Can be considered the prior guess that the concept is true before seeing any evidence • Pr(X|Y) is called the likelihood • E.g., Pr(X=1|Y=1) is the probability of having fever if we know one has flu • Pr(Y|X) is called the posterior probability • E.g., Pr(Y=1|X=1) is the probability of having flu, after knowing that one has fever CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014
Generalizations • In general, the above is true even if: • X involves a set of features X={X(1), X(2), ..., X(m)} instead of a single feature • Example: predict whether one has flu after knowing whether he/she has fever, headache and running nose • X can take on continuous values • In that case, Pr(X) is the probability density of X • Examples: • Predict whether a person has flu after knowing his/her body temperature • Predict whether a gene is involved in a biological pathway given its expression values in several conditions CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014
Parameter estimation • Let’s consider the discrete case first • Suppose we want to estimate the parameters of our flu model by learning from a set of known examples, (X1, Y1), (X2, Y2), ..., (Xn, Yn) – the training set • How many parameters are there in the model? • We need to know the prior probabilities, Pr(Y) • Two parameters: Pr(Y=1), Pr(Y=0) • Since Pr(Y=1) = 1 - Pr(Y=0), only one independent parameter • We need to know the likelihoods, Pr(X|Y) • Suppose we have m binary features, fever, headache, running nose, ... • 2m+1 parameters for all X and Y value combinations • 2(2m-1) independent parameters since for each value y of Y, sum of all Pr(X=x|Y=y) is one • Total: 2(2m-1) + 1 independent parameters • How large should n be in order to estimate these parameters accurately? • Very large, given the exponential number of parameters CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014
List of all the parameters • Let Y be having flu (Y=1) or not (Y=0) • Let X(1) be having fever (X(1)=1) or not (X(1)=0) • Let X(2) be having headache (X(2)=1) or not (X(2)=0) • Let X(3) be having running nose (X(3)=1) or not (X(3)=0) • Then the complete list of parameters for a generative model is (variables not independent are in gray): • Pr(Y=0), Pr(Y=1) • Pr(X(1)=0, X(2)=0, X(3)=0,|Y=0), Pr(X(1)=0, X(2)=0, X(3)=1,|Y=0), Pr(X(1)=0, X(2)=1, X(3)=0,|Y=0),Pr(X(1)=0, X(2)=1, X(3)=1,|Y=0), Pr(X(1)=1, X(2)=0, X(3)=0,|Y=0), Pr(X(1)=1, X(2)=0, X(3)=1,|Y=0), Pr(X(1)=1, X(2)=1, X(3)=0,|Y=0), Pr(X(1)=1, X(2)=1, X(3)=1,|Y=0) • Pr(X(1)=0, X(2)=0, X(3)=0,|Y=1), Pr(X(1)=0, X(2)=0, X(3)=1,|Y=1), Pr(X(1)=0, X(2)=1, X(3)=0,|Y=1),Pr(X(1)=0, X(2)=1, X(3)=1,|Y=1), Pr(X(1)=1, X(2)=0, X(3)=0,|Y=1), Pr(X(1)=1, X(2)=0, X(3)=1,|Y=1), Pr(X(1)=1, X(2)=1, X(3)=0,|Y=1), Pr(X(1)=1, X(2)=1, X(3)=1,|Y=1) CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014
Why having many parameters is a problem? • Statistically, we will need a lot of data to accurately estimate the values of the parameters • Imagine that we need to estimate the 15 parameters on the last page with only data about 20 people • Computationally, estimating the values of an exponential number of parameters could take a long time CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014
Conditional independence • One way to reduce the number of parameters is to assume conditional independence: If X(1) and X(2) are two features, then • Pr(X(1), X(2)|Y)= Pr(X(1)|Y,X(2))Pr(X(2)|Y) [Standard probability]= Pr(X(1)|Y)Pr(X(2)|Y) [Conditional independence assumption] • Probability for a flu patient to have fever is independent of whether he/she has running nose • Important: This does not imply unconditional independence, i.e., Pr(X(1)) and Pr(X(2)) are not assumed independent, and thus we cannot say Pr(X(1), X(2)) = Pr(X(1))Pr(X(2)) • Without knowing whether a person has flu, having fever and having running nose are definitely correlated CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014
Conditional independence and Naïve Bayes • Number of parameters after making the conditional independence assumption: • 2 prior probabilities Pr(Y=0) and Pr(Y=1) • Only 1 independent parameter, as Pr(Y=1) = 1 – Pr(Y=0) • 4m likelihoods Pr(X(j)=x|Y=y) for all possible values of j, x and y • Only 2m independent parameters, as Pr(X(j)=1|Y=y) = Pr(X(j)=0|Y=y) for all possible values of j and y • Total: 4m+2, which is much smaller than 2(2m-1)+1! • The resulting model is usually called a Naïve Bayes model CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014
Estimating the parameters • Now, if we have the known examples (X1, Y1), (X2, Y2), ..., (Xn, Yn) in the training set • The prior probabilities can be estimated in this way: • , where 𝕀 is the indicator function,with𝕀(true) = 1 and 𝕀(false) = 0 • That is , fraction of examples with class label y • Similarly, for any particular feature X(j), its likelihoods can be estimated in this way: • That is, fraction of class y examples having value x at feature X(j) • To avoid zeros, we can add pseudo-counts: • , where c has a small value CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014
Example • Suppose we have the training data as shown on the right • How many parameters does the Naïve Bayes model have? • Estimated parameter values using the formulas on the last page: • Pr(Y=1) = 3/8 • Pr(X(1)=1|Y=1) = 2/3 • Pr(X(1)=1|Y=0) = 2/5 • Pr(X(2)=1|Y=1) = 1/3 • Pr(X(2)=1|Y=0) = 1/5 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014
Meaning of the estimations • The formulas for estimating the parameters are intuitive • In fact they are also the maximum likelihood estimators – the values that maximize the likelihood if we assume the data were generated by independent Bernoulli trials • Let q=Pr(X(j)=1|Y=1) be the probability for a flu patient to have fever • This likelihood can be expressed as • That is, if a flu patient has fever, we include a q to the product; If a flu patient does not have fever, we include a 1-q to the product • Finding the value of q that maximizes the likelihood is equivalent to finding the q that maximizes the logarithm of it, since logarithm is an increasing function (a > b ln a > ln b) • This value can be found by differentiating the log likelihood and equating it to zero: • The formula for estimating the prior probabilities Pr(Y) can be similarly derived CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014
Short summary • So far, we have got the formulas for estimating the parameters of a Naïve Bayes model, which correspond to the parameter values, among all possible values, that maximize the data likelihood • The parameter estimates: • Prior probabilities: • Likelihoods: CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014
Using the model • Now with Pr(Y=y) and Pr(X(j)=x|Y=y) estimated for all features j and all values x and y, the model can be applied to estimate Pr(Y=y|X) for any X, either in the training set or not • Recall that • For classification, we can compare Pr(Y=1|X) and Pr(Y=0|X), and • Predict X to be of class 1 if the former is larger • Predict X to be of class 0 if the latter is larger CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014
Example • Suppose we have the same training data as shown on the right • Parameter values of Naïve Bayes model we have previously estimated: • Pr(Y=1) = 3/8 • Pr(X(1)=1|Y=1) = 2/3 • Pr(X(1)=1|Y=0) = 2/5 • Pr(X(2)=1|Y=1) = 1/3 • Pr(X(2)=1|Y=0) = 1/5 • Now, for a new subject with fever but not headache, we would predict its probability of having flu as Pr(Y=1|X(1)=1,X(2)=0)= Pr(X(1)=1|Y=1)Pr(X(2)=0|Y=1)Pr(Y=1) / [Pr(X(1)=1|Y=1)Pr(X(2)=0|Y=1)Pr(Y=1) + Pr(X(1)=1|Y=0)Pr(X(2)=0|Y=0)Pr(Y=0)]= (2/3)(1-1/3)(3/8) / [(2/3)(1-1/3)(3/8) + (2/5)(1-1/5)(1-3/8)]= 5/11 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014
Numeric features • If X(j) can take on continuous values, we need a continuous distribution instead of a discrete one • Fever is a feature with binary values: 1 means “has fever”; 0 means “does not have fever” • Body temperature is a feature with continuous values • For the features with binary values, we have assumed that each feature X(j) has a Bernoulli distribution conditioned on Y, i.e., Pr(X(j)=1|Y=y) = q with the value of parameter q to be estimated • For continuous values, we can similarly estimate Pr(X(j)=x|Y=y) based on an assumed distribution CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014
Gaussian distribution • Suppose the body temperatures of flu patients follow a Gaussian distribution: • There are two parameters to estimate: • The mean (center) of the distribution, • The variance (spread) of the distribution, 2 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014
Estimating the parameters • Maximum likelihood estimations [optional]: CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014
Estimating the parameters • Results: • The formulas: • Meanings: The mean and variance of the training data • The above formula for the variance is a biased estimation (i.e., when you have many sets of training data and each time you estimate the variance by this formula, the average of the estimations does not converge to the actual variance of the Gaussian distribution). • May use the sample variance instead, which is the minimum variance unbiased estimator – see further readings. CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014
Part 3 Logistic Regression
Discriminative learning • In the Bayes and Naïve Bayes classifiers, in order to compute Pr(Y|X) we need to model Pr(X,Y) or Pr(X|Y)Pr(Y) • It seems to be complicating things: Using the solution of a harder problem [modeling Pr(X,Y)] to answer an easier question [Pr(Y|X)] • We may not always have a good idea how to model Pr(X|Y) and Pr(Y) • For example, while assuming Gaussian for Pr(X|Y) is mathematically convenient, is it really suitable? • What if we cannot find a good well-studied distribution that fits the data well, or it is difficult to derive the maximum likelihood estimation formulas? • We now study a discriminative method that models Pr(Y|X) directly CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014
Logistic regression: the idea • The logistic regression model relies on the assumption that the class can be determined by a linear combination of the features • Conceptually, we hope to have a rule of this type:“If a1X(1) + a2X(2) + ... + amX(m) t, then Y=1; otherwise, Y=0” • If 0.2 <body temperature> + 0.5 <headache> + 0.6 <running nose> 8.1, then <has flu> = 1 • The coefficients a1, a2, ..., am and the threshold t are model parameters the values of which we want to estimate from training data • Graphically, the rule is a step function: CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014
Logistic regression: actual form • However, the step function is mathematically not easy to handle • For instance, it is not smooth, and thus not differentiable • Let’s model Pr(Y=1|X) using a smooth function, and then derive the classification rule based on it: • Let f(X) = exp(a1X(1) + a2X(2) + ... + amX(m) - t) • Pr(Y=1|X) = f(X) / [1 + f(X)] -- the logistic function • Pr(Y=0|X) = 1 - Pr(Y=1|X) = 1 / [1 + f(X)] • When a1X(1) + a2X(2) + ... + amX(m) >> t, Pr(Y=1|X)=1 and Pr(Y=0|X)=0 • When a1X(1) + a2X(2) + ... + amX(m) << t, Pr(Y=1|X)=0 and Pr(Y=0|X)=1 • When a1X(1) + a2X(2) + ... + amX(m) = t, Pr(Y=1|X) = Pr(Y=0|X) = 0.5 • We predict X to be of class 1 if Pr(Y=1|X) Pr(Y=0|X), i.e., f(X) 1 • We need to estimate the values of the model parameters a1, ..., am and t from training data. Will discuss how to do it. CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014
Visualizing the functions The original step function • The rule “if f(X) 1, then predict Y=1” is exactly the same as “if a1X(1) + a2X(2) + ... + amX(m)t, then predict Y=1” Can set parameter values so that are very similar Pr(Y=1|X) and Pr(Y=0|X) f(X): ratio of Pr(Y=1|X) and Pr(Y=0|X) CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014
Estimating the parameters • For Naïve Bayes, we estimated parameters using maximum likelihood, i.e., to find such that and , or their logarithms, are maximized • For logistic regression, we do not have models for these probabilities, so instead we directly maximize the conditional data likelihood, , where includes the parameters a1, a2, ..., am and t CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014
Maximizing the conditional likelihood • The log conditional data likelihood can be written as follows [optional]: CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014
Maximizing the conditional likelihood • Again, we can write down the expression for the partial derivative of ln L() with respect to each parameter: • However, when we set them to 0, each equation involves multiple parameters and we cannot get their optimal values separately • That is, they form a system of non-linear equations CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014
Gradient ascent • The system of non-linear equations has no closed-form formulas. • Instead, we use a numerical method to solve it • Main idea: since we hope each equation to be zero, we move them closer to zero iteratively • For example, since , we use the following update rule for t: , where is a small constant • In the right hand side of the assignment, current estimates of the parameters are used CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014
Meaning of gradient ascent • Why? It is like climbing a hill • At each point, the gradient is the direction with maximum increase • We want to move towards that direction, but for a small step each time so as not to overshoot • We don’t know exactly how large the step should be, because if we knew that we could jump to the peak directly (i.e., when we have the closed-form formulas, as in the case of maximum likelihood for Gaussian Naïve Bayes) Direction with maximum increase New estimate Current estimate ln L() t a a1 Image source: http://www.absoluteastronomy.com/topics/Hill_climbing CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014
Relationship with Naïve Bayes • Interestingly, logistic regression has a tight relationship with Naïve Bayes when each Pr(X(j)|Y) is a Gaussian distribution • [Optional] First, in general, the posterior probability Pr(Y=1|X) of Naïve Bayes is as follows: CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014
Relationship with Naïve Bayes • [Optional] Suppose Pr(X(j)|Y=1) has mean j1 and variance 2j, and Pr(X(j)|Y=0) has mean j0 and variance 2j (different means but same variance), then CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014
Relationship with Naïve Bayes • Plugging it back to the formula for Pr(Y=1|X), we havewhich is exactly the form of logistic regression, with coefficients and threshold • Weight of a feature, aj, depends on how well it separates the two classes • Threshold depends on the means of the Gaussians and the prior probabilities of the two classes CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014
Relationship with Naïve Bayes • In summary, if: • A Naïve Bayes classifier models Pr(X(j)|Y) by a Gaussian distribution with equal variance for Pr(X(j)|Y=1) and Pr(X(j)|Y=0) AND • A logistic regression classifier uses the coefficients and threshold • Then their predictions are exactly the same. CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014
Remarks • When the conditional independence assumption is not true, logistic regression can be more accurate than Naïve Bayes classifier • However, when there are few observed data points (i.e., n is small), Naïve Bayes could be more accurate CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014
Epilogue Case Study, Summary and Further Readings
Case study: Fallacies related to statistics • “According to this gene model, this DNA sequence has a data likelihood of 0.6, while according to this model for intergenic regions, this DNA sequence has a data likelihood of 0.1. Therefore the sequence is more likely to be a gene.” • Right or wrong? CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014
Case study: Fallacies related to statistics • Likelihood vs. posterior: • If Y represents whether the sequence is a gene (Y=1) or not (Y=0), and X is the sequence features, then the above statement is comparing the likelihoods Pr(X|Y=1) and Pr(X|Y=0), but we know that the posterior Pr(Y|X)=Pr(X|Y)Pr(Y)/Pr(X), and Pr(Y=1) << Pr(Y=0) • Another famous example: “This cancer test has a 99% accuracy, and therefore highly reliable.” CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014