460 likes | 562 Views
Sequence Entropy. Sequence Analysis. Significance of Alignment Positions. Observed occurrence of amino acids at some position in an alignment that deviates from expected may indicate some (functional) significance What ‘deviates from expected’? unlikely occurrences What is unlikely?
E N D
Sequence Entropy Sequence Analysis
Significance of Alignment Positions • Observed occurrence of amino acids at some position in an alignment that deviates from expected may indicate some (functional) significance • What ‘deviates from expected’? • unlikely occurrences • What is unlikely? • only (relatively) few possibilities to obtain observed result
Counting… • Number of possibilities W for finding given numbers Nx of aminoacids types x: • Problem: evaluates to huge numbers even for modest numbers of sequences…
Sequence Entropy • Entropy: • Using Stirlings approximation:ln M! ≈ M(ln M-1) for M≫1
Shannon’s ‘Information Entropy’: • ‘A Mathematical Theory of Communication’, The Bell System Technical Journal, Vol. 27, 1948. “ Can we define a quantity which will measure, in some sense, how much information is ‘produced’ by such a process, or better, at what rate information is produced? ” • He was thinking about the Transmission of Information, i.e., from a Source through some Channel to a Destination.
Choice, Uncertainty and Entropy • A set of ‘events’ with probabilities p1, p2, …, pn • Is there a measure that indicates how much ‘choice’ is possible, given those probabilities? • If there is, it should be: • continuous for all pi • monotonic in n if all probabilities are equal • additive for ‘sub-events’
Additivity: 1/2 1/2 1/2 1/3 1/3 2/3 1/2 1/3 1/6 1/6 H(1/2,1/2) H(2/3,1/3) H(1/2,1/3,1/6) = H(1/2,1/2) + 1/2 H(2/3,1/3)
Solution: Entropy • the entropy of a set of probabilities pi • measures information, choice and uncertainty • zero only if only one pi is not zero • there is only one choice • maximal if all pi are equal • most ‘uncertain’ situation: all options are possible
Information Content • Shannon was thinking about the Transmission of Information, i.e., from a Source through some Channel to a Destination. • …but it applies equally well to any type of ‘message’ • We can use it to measure the level of conservation in columns in an alignment
Simple Example: Sequence Entropy p1 = p2 = ½ p1 = 0 p2 = 0 p2 = f(‘A’) p1 = f(‘L’)
Conditional Probability / Prior Knowledge • Suppose we observe two things (A and B), and we suspect a relation (A causes B) • The fundamental question is then:How likely is A when we know B? • (n.b., this is Bayes statistics…) • Or, what is the uncertainty (entropy) of A knowing B?
Conditional Entropy • Entropy of joint occurrence of value i for event x and value j for event y : • and • so that • i.e., the entropy of a joint event is less than or equal to the sum of the individual entropies • it is equal only if the events are independent
Co-occurrence in practice • Measure of ‘mutial information’, by relative entropy: • more often written like: • i.e., what is the entropy in x, giveny
Relative Entropy in Sequence Analysis • Many biological problems relate to questions like: “Why do these proteins do this, and those proteins not?” • or “Why do these patients get sick, and those not?” • The answer can be related to similarities and differences between sequences • Similarities (conservation) relate to functionally critical positions • Differences can explain functional differences
Comparing groups of Sequences • For each position i in an alignment, we calculate the relative entropy for group Avs.B from the frequencies p (observed probabilities) of all aminoacid types x, as follows:
3.0 A p å i , x A p log i , x B p 2.5 x i , x B p å B i , x p log i , x A p 2.0 x i , x Entropy / Harmony 1.5 1.0 0.5 0.0 Simple Example: Relative Entropy ‘A/B’ A B
Relative Entropy • Captures similarities and differences • Is infinite for completely dissimilar positions • e.g., A vs. L • Not symmetrical: • Maybe not easy for selecting dissimilar positions
3.0 2.5 2.0 Entropy / Harmony 1.5 B p å i , x B p log i , x AB p x i , x 1.0 A p å i , x A p log i , x AB p x i , x 0.5 0.0 Simple Example: Relative Entropy ‘A/AB’ A B
Relative Entropy (A/AB) • Also captures similarities and differences • Is no longer infinite for completely dissimilar positions • Only symmetrical for equal size groups • in practice, not symmetrical • Maybe still not too easy for selecting dissimilar positions
Measuring Overlapping Distributions • Weigh both groups equally; take pA+pB in stead of pAB : • Fixed interval [0,1], but not completely symmetrical
3.0 2.5 2.0 A B p p å å i i , , x x A B p p log log Entropy / Harmony 1.5 i i , , x x + + A A B B p p p p x x i i , , x x i i , , x x 1.0 0.5 0.0 Entropy vs. Sequence Harmony: Example A B
Sequence Harmony • Introduce symmetry by averaging: • May seem a trivial choice, but: Pirovano, Feenstra & Heringa. “Sequence Comparison by Sequence Harmony Identifies Subtype Specific Functional Sites”, Nucleic Acids Res., in press (2006).
3.0 2.5 2.0 Entropy / Harmony 1.5 1.0 0.5 0.0 Entropy vs. Sequence Harmony: Example A B
Analyzing multiple groups • Relative Entropy and Sequence Harmony are defined to compare a set of groups (N=2) • problem for multiple groups (N>2) • A solution: Entropy-variability plots • variability is number of different aminoacid types in a certain alignment position • problem: variability always tends towards maximum (20) for larger number of sequences • Another solution: Two-entropy plots • Total family entropy vs. sum of sub-family entropy
Two-Entropies analysis of GPCRs • Goal: to identify the function of individual positions Ye, Lameijer, Beukers & IJzerman. “A Two-Entropies Analysis to Identify Functional Positions in the Transmembrane Region of Class A G Protein-Coupled Receptors”, Proteins 63, pp1018–1030 (2006)
G-protein Coupled Receptors (GPCRs) • Huge family of integral cell-membrane proteins • crucial in signal transduction • 70 subfamilies • 1935 Class A GPCRs • Three main regions: • extracellular side • transmembrane (TM) • cytoplasmic side
Two-entropies vs. Entropy-variability:upper vs. lower domain Red: upper domainBlue: lower domain
Two-entropies vs. Entropy-variability:relative solvent accessibility (RSA) Red: RSA < 15%Blue: RSA > 15%
Two-entropies vs. Entropy-variability:ligand binding site vs. other positions Subfamily specific binding Red: <4 Å from retinalBlue: >4 Å from retinal Subfamily specific binding Common activation mechanism Common activation mechanism
GPCR Two-entropies analysis Red: ligand binding Green: coupling & activation Blue: others
GPCR Functional Site Prediction: Method Comparison Finding known sites for ligand binding in Bovine Rhodopsin (left) and Aminergic Receptors (right)
1 0 260 280 300 320 340 360 380 400 420 440 460 Smad-MH2 Alignment & Sequence Harmony • Walter Pirovano*, K. Anton Feenstra* and Jaap Heringa. “Sequence Comparison by Sequence Harmony Identifies Subtype Specific Functional Sites”, Nucleic Acids Res., in press (2006). • K. Anton Feenstra, Walter Pirovano and Jaap Heringa. “Sub-type Specific Sites for SMAD Receptor Binding Identified by Sequence Comparison using ‘Sequence Harmony’ ”. in: From Computational Biophysics to Systems Biology. pp. 73-78. Eds. U.H.E. Hansmann, J. Meinke, S. Mohanty and O. Zimmermann, Jülich, NIC Series, Vol. 34, 2006. • Elena Marchiori*, Walter Pirovano, Jaap Heringa and K. Anton Feenstra*. “A Feature Selection Algorithm for Detecting Subtype Specific Sites for Smad Receptor Binding”, Bio-ICMLA06, accepted (2006).
262 270 280 290 300 310 AR BR Smads: Comparing two Groups
Smad-MH2 Alignment & Functionally Specific Sites • 29 known sites of functional specificity • based mostly on site-specific mutants and characterized on affinity for binding to BMPR-I vs. TBR-I receptor types
Finding Low-harmony sites in Smad-MH2 270 280 290 300
Finding Low-harmony sites in Smad-MH2 Pirovano, Feenstra & Heringa. “Sequence Comparison by Sequence Harmony Identifies Subtype Specific Functional Sites”, Nucleic Acids Res., in press (2006).www.few.vu.nl/~feenstra/articles/NAR 2006 Sequence Harmony.pdf
Smad-MH2: Functional Clusters R427 TbR-I/BMPR-I/ALK1/2 A323 receptor-binding M327 T430 V325 TbR-I/ALK1/2 Q284 TbR-I/BMPR-I A354 R410 V461 W368 P378 R462 C463 P360 ? Y366 Q407 FAST1, Mixer, SARA S460 R334 Q400 Q364 N381 R365 L440 ? T298 R337 F346 co-repressors A392 SARA/Mixer L297 retention & transcription factors Q309 N443 c-Ski/SnoN I341 S308 P295 F273 Q294 A272 SARA S269 T267
Conclusions Smad-MH2 • 40 Sites of Low Sequence Harmony in Smad-MH2 • different between the AR (TGF-b) and BR (BMP) sub-type Smads • Low Harmony sites in Smad-MH2 are functionally relevant • Other methods cannot select all known sites! • Functional Sites are Interaction Surfaces on Protein Surface: • Next: Analyze Interaction Partners in the Pathway • 14 Low Harmony Sites in Smad-MH2 of unknown function • 11 putative functions from structural considerations • promising candidates that determine TGF-b/BMP specificity • confirm (or rebuke) putative functions?
Sequence Harmony Webserver http://www.ibi.vu.nl/programs/seqharmwww1-b/