440 likes | 445 Views
This lecture discusses the challenges of detecting and aligning protein sequences for distant homology, and introduces the Maximum Entropy approach for modeling protein sequence probability distributions. It explores how coevolution analysis can extract information from multiple sequence alignments and infer residue-residue contacts. The lecture also covers concepts in probability theory and information theory, and their application in protein sequence analysis.
E N D
Lecture 3: Maximum Entropy approach for modeling protein sequence probability distributions
The general problem • Distant (remote) homology poses challenges: • Different length • Abundant changes at each position • How to detect homology? • How to align sequences?
Learning from variation If the problem can be solved for a set of sequences “representative” of the family, then we can leverage this knowledge to assess whether or not a given sequence “looks like” this group. P(x1, x2,…, xL) ?
What are these lectures about? • We have discussed the nuts and bolts of Hidden Markov Models (HMMs), showing how these models are initialized from a database of sequences how they can generate multiple sequence alignments (MSA) • We will show how to extract information from the multiple sequence alignments generated with the HMMs. We will thus introduce a promising, increasingly used approach: Maximum Entropy modeling for studying coevolution
Analyzing multiple sequence alignments: maximum entropy approach Can the multiple sequence alignment be used to extract information about the 3D structure of proteins?
Contacting residues co-evolve A mutation is accompanied by a compensatory one: can we exploit this correlations to infer residue-residue contacts from multiple sequence alignments? The problem is that correlations have transitive character, therefore all the amino acids are seemingly connected! To solve this problem we need to model the probability distribution (get a formula) and disentangle direct statistical couplings from indirect ones.
A Mathematical Theory of Information (1948) Claude Shannon
Here is the plan for MaxEnt • We will introduce two important concepts in Information Theory (Surprisal and Entropy) • We will review the main idea about constrained optimization with Lagrange multipliers • Finally, we will use these ingredients to generate a probabilistic model for protein sequences and show that this model highlights crucial information about structure and structural dynamics.
MaxEnt modeling and coevolution analysis • De Juan, D., Pazos, F. and Valencia, A., 2013. Emerging methods in protein co-evolution. Nature reviews. Genetics, 14(4), p.249. • Pressé, Steve, Kingshuk Ghosh, Julian Lee, and Ken A. Dill. Principles of maximum entropy and maximum caliber in statistical physics. Reviews of Modern Physics 85, no. 3 (2013): 1115. • Cover, Thomas M., and Joy A. Thomas. Elements of information theory. John Wiley & Sons, 2012.
Let’s brush up on probability… • A Probability is a number assigned to each subset (they are called events) of a sample space satisfying the followingrules: • For any event A, 0 P(A) 1. • P() =1. • If A1, A2, … An is a partition of A, then • P(A) = P(A1)+ P(A2)+...+ P(An) • (A1, A2, … An is called a partition of A if A1A2 …An = A and A1, A2, … An are mutually exclusive.) “Probability theory is nothing but common sense reduced to calculations” Laplace (1819) A B Events A and B occur with joint probability P(A∩B) A∩B
Some simple concepts to keep in mind Addition rule for mutually exclusive events: P (Aor B) = P (A) + P (B) (mutually exclusive events – the occurrence of one event prevents the occurrence of the other).
Some simple concepts to keep in mind Generalized Addition rule: P (AB) = P (A) + P (B) –P (A ∩ B) A B (we dropped the request that event are mutually exclusive) P(A∩B) A∩B
Some simple concepts to keep in mind The “product” of two events: in the process of measurement, we observe both events. Multiplication rule for independent events: P (Aand B) = P (A) x P (B) Independent events: the outcome of one event is not affected by the outcome of the other
Conditional Probability We are restricting the sample space to B (think of P(B) as a normalization factor), we say this in words: What is the probability that A occur given that B occurred? A B A∩B
Some simple concepts to keep in mind Generalized Multiplication rule: P (A∩B) = P (A|B) x P (B) = P (B|A) x P (A)
Collections of probabilities are described by distribution functions
Expected value of a distribution • Expected value is just the average or mean (µ) of random variable x. • It’s also how we expect X to behave on-average over the long run (“frequentist” view again). • It’s sometimes called a “weighted average” because more frequent values of X are weighted more highly in the average.
Expected value, formally Discrete case: Continuous case:
Done with the short review on probability…Let’s get back to our problem (information)…
A primer in Information Theory • What is Information? • Information is transferred from an originating entity to a receiving entity (via a message). • Note: if the receiving entity knows already the content of a message with certainty, the amount of information is zero. • Flip of a coin: how much information do we receive when we are told that the outcome is head? • If we already knew the answer, i.e., P(head) = 1, the amount of information is zero! • If it’s a fair coin, i.e., P(head) = P (tail) = 0.5, we say that the amount of information is 1 bit. • If the coin is not fair, e.g., P(head) = 0.9, the amount of information is more than zero but less than one bit! • Intuitively, the amount of information received is the same if P(head) = 0.9 or P (head) = 0.1.
Self-information or Surprisal I(p) increases as the probability decreases (and viceversa) I(p) ≥ 0 – information is non-negative I(1) = 0 – events that always occur do not communicate information I(p(1&2)) = I(p1) + I(p2) – information due to independent events is additive
Self-information or Surprisal Let’s analyze better n.4: Since we know that for independent events: N.4 implies that: This suggests a unique functional form!!
Self-information or Surprisal Anti-monotonic in p non-negative Null if the event is certain Additive for independent events
Shannon entropy The average amount of information that we receive per event:
Shannon entropy Entropy as a function of the probability of getting “head” in the coin flip experiment: Entropy is maximum when my “prior knowledge” is minimum
MaxEnt modeling Maximum Entropy • Why maximum entropy? • Maximize entropy = Minimize bias • Model all that is known and assume nothing about what is unknown. • Model all that is known: satisfy a set of constraints that must hold • Assume nothing about what is unknown: choose the most “uniform” distribution choose the one with maximum entropy
MaxEnt modeling … the fact that a certain prob distribution maximizes entropy subject to certain constraints representing our incomplete information, is fundamental property which justifies use of that distribution for inference; it agrees with everything that is known, but carefully avoids assuming anything that is not known. It is a transcription into mathematics of an ancient principle of wisdom … (Jaynes, 1990) [from: A Maximum Entropy Approach to NLP by A.L.Berger, S.A.DellaPietra and V.J.DellaPietra, In Computational Linguistics, Vol. 22, Number 1, 1996]
How do we model what we know? Empirical (observed) probability of x: Model (theoretical) probability of x: Function of x, the expected value of which is known: Observed expectation (empirical counts): Model expectation (theoretical prediction): We request the model to reproduce the observed statistics: i.e., we impose a constraint
Constrained optimization Free maximum Constrained maximum constraint
Using Lagrange multipliers for MaxEnt Maximize L(p): The probability distribution turns out to be a Boltzmann distribution!
MaxEnt for protein sequences Assume a model that is as random as possible, but that agrees with some average calculated on the data: In our case univariate and bivariate marginals are constrained to reproduce empirical frequency counts for single MSA columns and column pairs: With the constraints: The model distribution then becomes: and A unique set of Lagrange multipliers bn will then satisfy all the constraints Weigt et al PNAS 2008
Contacting residues are statistically coupled Amino acids interact in pairs (in a statistical sense)!
Webservers for structural predictions based on evolutionary coupling analysis Ab initio Structure prediction
Predicting dynamics GranataD, Ponzoni L, Micheletti C, Carnevale V, bioRxiv 109397; doi: https://doi.org/10.1101/109397.
Take home message: • Maximum entropy approach (with constraints of joint frequencies) provides a model that is extremely useful for: • Inferring tertiary and quaternary contacts in proteins and protein complexes; this approach is becoming the standard in structure prediction • Beyond structure: protein dynamics?