CISC 841 Bioinformatics (Fall 2007) Hidden Markov Models

CISC 841 Bioinformatics(Fall 2007)Hidden Markov Models Model comparison CISC841, F07, Liao

How to tell if two HMMs are equivalent? • If not equivalent, how (dis-)similar are they? Remember: HMMs are generative Given a sequence x, P(x|M) is the probability that x can be generated from the model M. How to compare two probability distribution? Mutual entropy H(M, M’) =  xP(x|M) log [P(x|M)/P(x|M’)] CISC841, F07, Liao

Mutual entropy: H(p|q)  0 (why?) H(p|q) = 0 iff p = q Complexity of comparing HMMs - It is proved to be NP-hard. (Lyngso and Pedersen, LNCS, 2001, 2223:416-428.) CISC841, F07, Liao

D D D I I I I Start M M M End Hidden Markov Model Observed emission/transition counts node position 0 1 2 3 ------------------ A – 4 0 0 C – 0 0 4 G – 0 3 0 T – 0 0 0 ------------------ A 0 0 6 0 C 0 0 0 0 G 0 0 1 0 T 0 0 0 0 ------------------ MM 4 3 2 4 MD 1 1 0 0 MI 0 0 1 0 IM 0 0 2 0 ID 0 0 1 0 II 0 0 4 0 DM – 0 0 1 DD – 1 0 0 DI – 0 2 0 X X . . . X bat A G – – – C rat A – A G –C cat A G – A A– gnat – – A A AC goat A G – – – C 1 2 3 CISC841, F07, Liao 0 1 2 3

query V G A H - A G E Y hit A G A H D - G E F Seq dbase query V G A - - H A G E Y A G A - - H D G E F V - - - - N V D E F C K A - - D V A G H V K G - - - - - - F V L S - - T I E T S D N K - - T I A K H I A G A D T G A G V query A G A - - H D G E F V - - - - N V D E F C K A - - D V A G H V K G - - - - - - F V L S - - T I E T S D N K - - T I A K H I A G A D T G A G V hit hit V G A - - H A G E Y Prof dbase Seq dbase V G A - - H A G E Y V - - - - N V D E V V E A - - D V A G H V K G - - - - - - D V Y S - - T Y E T S F N A - - N I P K H I A G A D N G A G V query A G A - - H D G E F V - - - - N V D E F C K A - - D V A G H V K G - - - - - - F V L S - - T I E T S D N K - - T I A K H I A G A D T G A G V hit Prof dbase Comparison levels for homology detection • Sequence-to-sequence (pair wise) - for proteins with relatively high sequence identity - dynamic programming methods • Sequence-to-profile - for distant relationships and improved alignment accuracy - PSI-BLAST, HMMER, SAM • Profile-to-profile - for more sensitivity and accuracy of alignment - COMPASS, Prof_sim CISC841, F07, Liao

Performance quantifiers query hit e-valrelationship tpfp HBA_HUMANHBB_HUMAN 3.87e-60 +1 1 0 MYG_PHYCA 5.02e-23 +1 2 0 GLB3_CHITP 5.60e-4 +1 3 0 GLB5_PETMA 1.43e-1 -1 3 1 GLB2_LUPLU 1.56e+1 +1 4 1 GLB1_GLYDI 1.45e+3 -1 4 2 tp : true positive count fp : false positive count relationship is +1 if query and hit sequences are related at super family level Sequence based alignment Structure based alignment V G A H - A G E Y G A H A G E A GAH D - GEF G A H D G E sidQmQdQc 5/6 5/7 5/6 5/8 Modeler’s accuracy metric (Qm) = Nc/Nseq Developer’s accuracy metric (Qd) = Nc/Nstr Combined metric (Qc) = Nc / (Nseq + Nstr – Nc) where Nc = number of aligned pairs common to both alignments Nseq = number of aligned pairs in the sequence based alignemnt Nstr = number of alinged pairs in the structure based alignment • Ability to detect distant relationships - sensitivity - specificity • Accuracy of alignment prediction • (when compared to corresponding structure based alignment) CISC841, F07, Liao

On profile-profile comparisons L2 L1 A G A - - H D G E F V - - - - N V D E F C K A - - D V A G H V K G - - - - - - F V L S - - T I E T S D N K - - T I A K H I A G A D T G A G V V G A - - H A G E Y V - - - - N V D E V V E A - - D V A G H V K G - - - - - - D V Y S - - T Y E T S F N A - - N I P K H I A G A D N G A G V 1 . . 20 Numeric profiles L2 Subs matrix L1 V G A - H A G E Y V - - - N V D E V V E A - D V A G H V K G - - - - - D V Y S - T Y E T S F N A - N I P K H I A G - N G A G V A G A H D - G E F V - - N V - D E F C K A D V - A G H V K G - - - - - F V L S T I - E T S D N K T I - A K H I A G T G - A G V • From MSA to numeric profiles - sampling - dropping columns • Alignment of numeric profiles • - scoring functions - dynamic programming alignment • Example: COMPASS • (Sadreyev et. al. J. Mol. Biol. (2003) 326, pp. 317–336. ) CISC841, F07, Liao

Quasi consensus based comparison of HMMs V G A - - H A G E Y V - - - - N V D E V V E A - - D V A G H V K G - - - - - - D V Y S - - T Y E T S F N A - - N I P K H I A G A D N G A G V A G A - - H D G E F V - - - - N V D E F C K A - - D V A G H V K G - - - - - - F V L S - - T I E T S D N K - - T I A K H I A G A D T G A G V M1 M2 Consensus 1 Consensus 2 V G A N V A E H V K A T I A E H V G A - - N V A E H V K A - - T I A E H V G A - - H A G E Y V - - - - N V D E V V E A - - D V A G H V K G - - - - - - D V Y S - - T Y E T S F N A - - N I P K H I A G A D N G A G V A G A - - H D G E F V - - - - N V D E F C K A - - D V A G H V K G - - - - - - F V L S - - T I E T S D N K - - T I A K H I A G A D T G A G V S(c2|M1) S(c1|M2) V - K A - T I A E H V - G A N - V A E H Seed 1 V G A - - H A G E Y V - K A - T I A E H A - G A - H D G E F A G A - - H D G E F V - G A N - V A E H V - G A H - A G E Y Seed 2 Consensus2 Consensus 1 Seed 2 Seed 1 A - G A - H D G E F V G A - - H A G E Y A G A - - H D G E F V - G A H - A G E Y Aln21 Aln12 • Build profile HMMs using existing packages (SAM-T99 or HMMER) • Generation of quasi consensus • sequence from the model • Alignment of consensus sequence • of a model with another model • Extraction of two alignments in • each direction CISC841, F07, Liao

Benchmark experiment I : Detection ability • All-vs-all comparisons of 569 MSAs from (Wang and Dunbrack, 2004) using COMPASS and QC-COMP. Two MSAs are said to be related if their seed sequences are from the same SCOP superfamily. • In all-vs-all comparisons using QC-COMP, the ith HMM is used to score consensus sequences from the remaining 568 HMMs and • the resulting scores are transformed into z-scores zi(ck) = [si(ck) - <s>]/ • Mi = { zi (c1), zi (c2), . . . zi (ci-1), zi (ci+1), . . ., zi (c569) } • Mj = { zj (c1), zj (c2), . . . zj (cj-1), zj (cj+1), . . ., zj (c569) } • dij = Mi .ej = zi (cj) asymmetric similarity measure between Mi and Mj • dij = Mi .ej + Mj .ei = zi (cj) + zj (ci) symmetric similarity measure between Mi and Mj • Same experiment is repeated using seed sequences instead of • consensus sequences • For COMPASS, the ith profile is compared with the remaining 568 profiles and the scores are transformed into z-scores. The same similarity measures are used. We also consider E-values measures. CISC841, F07, Liao

Results for detection ability experiment ROC values CISC841, F07, Liao

V G A - H A G E Y V - - - N V D E V V E A - D V A G H V K G - - - - - D V Y S - T Y E T S F N A - N I P K H I A G - N G A G V V G A - - N V A E H V K A - - T I A E H V G A - - H A G E Y V - - - - N V D E V V E A - - D V A G H V K G - - - - - - D V Y S - - T Y E T S F N A - - N I P K H I A G A D N G A G V A G A - - H D G E F V - - - - N V D E F C K A - - D V A G H V K G - - - - - - F V L S - - T I E T S D N K - - T I A K H I A G A D T G A G V A G A H D - G E F V - - N V - D E F C K A D V - A G H V K G - - - - - F V L S T I - E T S D N K T I - A K H I A G T G - A G V S(c2|M1) S(c1|M2) A G A H D - G E F V G A - H A G E Y COMPASS Extracted alignment Accuracy parameters Qm, Qd and Qc Extraction schemes - MAX - AND - AND1 - AND2 - AND3 A - G A - H D G E F V G A - - H A G E Y A G A - - H D G E F V - G A H - A G E Y Aln21 Aln12 Benchmark experiment II : Alignment accuracy • 2305 pairs of MSAs from (Wang and Dunbrack, 2004) were aligned using COMPASS and QC-COMP. • Same experiment is repeated using seed sequences instead of consensus sequences RegionIdentity range#Pairs 1 0.00 - 0.05 58 2 0.05 - 0.10 522 3 0.10 - 0.15 598 4 0.15 - 0.20 382 5 0.20 - 0.25 258 6 0.25 - 0.30 217 7 0.30 - 0.35 162 8 0.35 - 0.40 108 CISC841, F07, Liao

Results for alignment accuracy experiment Consensus based CISC841, F07, Liao

Results for alignment accuracy experiment Seed based CISC841, F07, Liao

Mix Results for alignment accuracy experiment Mixing scheme: if the symmetric similarity measure between a pair of HMMs is less than –22.0, seed-based alignment is taken. Otherwise, consensus-based alignment is chosen. The threshold –22.0 was determined using a separate training set (1136 pairs of HMMs). CISC841, F07, Liao

CISC841, F07, Liao

CISC 841 Bioinformatics (Fall 2007) Hidden Markov Models

CISC 841 Bioinformatics (Fall 2007) Hidden Markov Models

Presentation Transcript

Basics of Protein Bioinformatics and Structural Bioinformatics

CISC 101: Fall 2011

CISC 101: Fall 2011

CISC 101: Fall 2011

Seminar in Bioinformatics (236818)

CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (I)

CISC 101: Fall 2011

CISC 101: Fall 2011

CISC 467/667 Intro to Bioinformatics (Fall 2005) Protein Structure Prediction

CISC 667 Intro to Bioinformatics (Spring 2007) Whole genome sequencing

CISC 667 Intro to Bioinformatics (Spring 2007) Sequence pairwise alignment

CISC 667 Intro to Bioinformatics (Spring 2007) Genetic networks and gene expression data

CISC 841 Bio Informatics

CISC 467/667 Intro to Bioinformatics (Fall 2005) Gene Prediction and Regulation

CISC 841 - BIOINFORMATICS

CISC 667 Intro to Bioinformatics (Fall 2005) Pairwise sequence alignment

CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search

CISC 467/667 Intro to Bioinformatics (Spring 2007) Protein Structure Prediction

CISC 667 Intro to Bioinformatics (Spring 2007) Phylogenetic Trees (III)

CISC 841 Bioinformatics (Fall 2008) Review Session

CISC 841 Bioinformatics Combining HMMs with SVMs

CISC 841 - BIOINFORMATICS