480 likes | 640 Views
Scores and substitution matrices in sequence alignment. Sushmita Roy BMI/CS 576 www.biostat.wisc.edu/bmi576/ Sushmita Roy sroy@biostat.wisc.edu Sep 11 th , 2014. Key concepts in today’s class. Probability distributions Discrete and continuous continuous distributions
E N D
Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576 www.biostat.wisc.edu/bmi576/ Sushmita Roy sroy@biostat.wisc.edu Sep 11th, 2014
Key concepts in today’s class • Probability distributions • Discrete and continuous continuous distributions • Joint, conditional and marginal distributions • Statistical independence • Probabilistic interpretation of scores in alignment algorithms • Different substitution matrices • Estimating simple substitution matrices • Assessing significance of scores
RECAP • Four issues in sequence alignment • Type of alignment • Algorithm for alignment • Scores for alignment • Significance of alignment scores
Guiding principles of scores in alignments • We want to know whether the alignment we observed is meaningful or by chance • By meaningful we mean whether the sequences represent a similar biological function • To study whether what we observe by chance we need to understand some concepts in probability theory
Definition of probability • Intuitively, we use “probability” to refer to our degree of confidence in an event of an uncertain nature. • Always a number in the interval [0,1] 0 means “never occurs” 1 means “always occurs” • Frequentistinterpretation: the fraction of times an event will be true if we repeated the experiment indefinitely • Bayesian interpretation: degree of belief in an event
Sample spaces • Sample space: a set of possible outcomes for some experiment • Examples • Flight to Chicago: {on time, late} • Lottery: {ticket 1 wins, ticket 2 wins,…,ticket n wins} • Weather tomorrow: {rain, not rain} or {sun, rain, snow} or {sun, clouds, rain, snow, sleet} • Roll of a die: {1,2,3,4,5,6} • Coin toss: {Heads, Tail}
Random variables • Random variable:A variable that represents the outcome of an experiment • A random variable can be • Discrete: Outcomes take a fixed set of values • Roll of die, flight to chicago, weather tomorrow • Continuous: Outcomes take continuous values • Height, weight
Notation • Uppercase letters and words denote random variables • X, Y • Lowercase letters and words denote values • x, y • Probability that X takes value x • We’ll also use the shorthand form • For Boolean random variables, we’ll use the shorthand
0.3 0.2 0.1 sun rain sleet snow clouds Discrete Probability distributions • A probability distribution is a mathematical function that specifies the probability of each possible outcome of a random variable • We denote this as P(X) for random variable X • It specifies the probability of each possible value of X, x • Requirements:
Jointprobability distributions • Joint probability distribution: the function given byP(X = x, Y = y) • Read “X equals xandY equals y” • Example probability that it’s sunny and my flight is on time
Marginalprobability distributions • Themarginal distribution of X is defined by “the distribution of X ignoring other variables” • This definition generalizes to more than two variables, e.g.
Marginal distribution example joint distribution marginal distribution for X
Conditional distributions • The conditional distribution of Xgiven Y is defined as: • Or in short • The distribution of X given that we know the value of Y • Intuitively, how much does knowing Y tell us about X?
Conditional distribution example conditional distribution for X givenY=on-time joint distribution
Independence • Two random variables, X and Y, are independentif • Another way to think about this is knowing X does not tell us anything about Y
Independence example #1 joint distribution marginal distributions Are X and Y independent here? NO.
Independence example #2 joint distribution marginal distributions Are X and Y independent here? YES.
Conditional independence • Two random variables X and Y are conditionally independent given Z if • “once you know the value of Z, knowing Y doesn’t tell you anything about X” • Alternatively
Conditional independence example Are Fever andHeadache independent? NO.
Conditional independence example Are Fever andHeadache conditionally independent given Flu: YES.
Chain rule of probability • For two variables • For three variables • etc. • to see that this is true, note that
Example discrete distributions • Binomial distribution • Multinomial distribution
The binomial distribution • Two outcomes per trial of an experiment • Distribution over the number of successes in a fixed number n of independent trials (with same probability of success p in each) • e.g. the probability of x heads in ncoin flips p=0.5 p=0.1 P(X=x) P(X=x) x x
The multinomial distribution • k possible outcomes on each trial • Probability pifor outcome xi in each trial • Distribution over the number of occurrences xifor each outcome in a fixed number n of independent trials • e.g. with k=6 (a six-sided die) and n=30 vector of outcome occurrences
Continuous random variables • When our outcome is a continuous number we need a continuous random variable • Examples: Weight, Height • We specify a density function for random variable X as • Probabilities are specified over an interval to derive probability values • Probability of taking on a single value is 0.
Continuous random variables contd • To define a probability distribution for a continuous variable, we need to integrate f(x)
Examples of continuous distributions • Gaussian distribution • Exponential distribution • Extreme Value distribution
Gaussian distribution • Gaussian distribution
Extreme Value Distribution • Used for describing the distribution of extreme values of another distribution f(x) Max values from 1000 sets of 500 samples from a standard normal distribution
Guiding principles of scores in alignments • We need to assess whether an alignment is biologically meaningful • Compute the probability of seeing an alignment under two models • Related model R • Unrelated model U
R: related model • Assume each pair of aligned positions evolved from a common ancestor • Let pab be the probability of observing a pair {a,b} • Probability of an alignment betweenxand y is
U: unrelated model • Assume the individual amino acids at a position are independent of the amino acid in another position. • Let qa be the probability of a • The probability of an n-character alignment of x and y is
Determine which model is more likely • The score of an alignment is the relative likelihood of the alignment from Uand R • Taking the log we get
Score of an alignment Score of an alignment is Substitution matrix entry should thus be
How to estimate the probabilities? • Need a good set of confirmed alignments • Depends upon what we know about when the two sequences might have diverged • pab for closely related species is likely to be low if a !=b • pab for species that have diverged a long time ago is likely close to the background.
Some common substitution matrices • BLOSUM matrices [Henikoff and Henikoff, 1992] • BLOSUM45 • BLOSUM50 • BLOSUM62 • Number represents percent identity of sequences used to construct substitution matrices • PAM [Dayhoff et al, 1978] • Empirically, BLOSUM62 works the best
BLOSUM matrices • BLOck Substitution Matrix • Derived from a set of aligned ungapped regions from protein families called BLOCKS • Reside in the BLOCKS database • Cluster proteins such that they have no less than L% of similarity • BLOSUM50: Proteins >50% similarity are in the same group • BLOSUM62: Proteins >62% similarity are in the same group • Calculate substitution frequencies Aab of observing a in one cluster and b in another cluster
Estimating the probabilities in BLOSUM Number of occurrences of a and b Number of occurrences of a
Example of BLOSUM matrix calculation A T C K Q A T C R N A S C K N S S C R N Probabilities at 62% similarity S D C E Q S E C E N 7 T E C R Q Three blocks at 50% similarity Seven blocks at 62% similarity
Assessing significance of the alignment score • There are two ways to do this • Bayesian framework • Classical approach
The classical approach to assessing sequence :Extreme Value Distribution • Suppose we have a particular substitution matrix and amino-acid frequencies • We need to consider random sequences of lengths m and n and finding the best alignment of these sequences • This will give us a distribution over alignment scores for random pairs of sequences • If the probability of a random score being greater than our alignment score is small, we can consider our score to be significant
The extreme value distribution • we’re picking thebest alignments, so we want to know what the distribution of max scores for alignments against a random set of sequences looks like • this is given by an extreme value distribution P(x) x
Assessing significance of sequence score alignments • It can be shown that the mode of the distribution for optimal scores is • K, λestimated from the substitution matrix • Probability of observing a score greater than S
Bayes theorem • An extremely useful theorem • There are many cases when it is hard to estimate P(x| y) directly, but it’s not too hard to estimate P(y| x) andP(x)
Bayes theorem example • MDs usually aren’t good at estimating P(Disorder| Symptom) • They’re usually better at estimating P(Symptom| Disorder) • If we can estimate P(Fever| Flu) and P(Flu) we can use Bayes’ Theorem to do diagnosis