250 likes | 396 Views
Predicting co-evolving pairs in Pfam using information theory where entropy is determined by phylogenetic mutation events. Scooter Willis University of Florida Computer and Information Science and Engineering. Homology modeling. Proteins grouped by function will share similar structures
E N D
Predicting co-evolving pairs in Pfam using information theory where entropy is determined by phylogenetic mutation events Scooter Willis University of Florida Computer and Information Science and Engineering
Homology modeling • Proteins grouped by function will share similar structures • Pfam is a large collection of protein sequences grouped by Hidden Markov Models • Pfam 19.0 December 2005 8183 protein families where 2,765 have one or more solved PDB structures
Pfam5000 • “Implications of Structural Genomics Target Selection Strategies:Pfam5000, Whole Genome, and Random Approaches”, John-Marc Chandonia and Steven E. Brenner, PROTEINS: Structure, Function, Bioinformatics (2005) • NIH is supporting structural genomics projects at 9 pilot centers through the Protein Structure Initiative. • Funding is $300 million over the next five years
Co-evolving pairs • Co-evolving pairs is defined as two amino acids > 10 sequence positions apart but within 12 angstroms of each other in 3D space • Apply information theory to protein families to detect co-evolving pairs which provides indicates tertiary placement of secondary structures • Actively research topic with numerous publications in the last 5 years • Accepted that the information value is present but difficult to separate from background noise
Information Theory Approach • The measure of entropy H(x), where x is a discrete random variable and p(x) is the probability function, deals with the randomness or uncertainty there is in a signal and is calculated with the following formula.
H(X) H(Y) H(X|Y) MI(X,Y) H(Y|X) H(X,Y) Mutual Information Venn Diagram
Sampling • The difficulty of applying statistical methods to data sets in genetic sequences is that they tend to not be random samples and the extent of the entire population set is unknown • The bias towards protein sequences that have medical research value and the corresponding phylogenetic influences introduces noise or indistinguishable background signal which decreases the quality of statistical measures • The primary impact is on accurately measuring probability because the sample statistics do not reflect the population’s statistics
Phylogenetic Effect • Reducing the impact of the phylogenetic effect when calculating probability will improve the quality of the information and the ability to detect co-evolving pairs
A A C C A D C C C C C Phylogenetic Tree
XX XX CD CD AE DE CD XX CD CD CE Phylogenetic Tree Mutation Events
Results • Comparison of mutual information calculation using standard probability of sequences (STA) and probability with reduced phylogenetic effect (RPE) • Interested in amino acid pairs that have a mutual information score greater than four standard deviations from the mean (Z>=4) • Maximize the percentage of co-evolving pairs that are less than 12 angstroms apart and greater than 10 sequence positions apart
QGYSLPPEEDLLGLGMTSTSFI.........RGIYLQNAKTLEEYHNTVLRGTF..ATV..KS.KIL.TED.DRI.RK..WAIHKL.MCTFTI...NKEEFFNLFGY....EFDTYFIESR...DRL.IS.METT....GLIH......NSPGS.LKVTPLGELQGYSLPPEEDLLGLGMTSTSFI.........RGIYLQNAKTLEEYHNTVLRGTF..ATV..KS.KIL.TED.DRI.RK..WAIHKL.MCTFTI...NKEEFFNLFGY....EFDTYFIESR...DRL.IS.METT....GLIH......NSPGS.LKVTPLGEL QGYSLPPEEDLLGFGISATSFI.........RGIYLQNVKDLREYSETIQAGKL..ATV..KG.KIL.SQD.DKT.RK..WVIHTL.MCSFSL...SKLEFEQRFHE....RFDRYFADSY...DRL.CG.MESA....GLIR......QDSSS.LQVTPLGEL QGYTTKGGADLVGVGLTSIGEG.........QRHYAQNFKDMSSYEAALDRGVL..PFE..RG.VIL.SDD.DEL.RK..AVIMEL.MANFKL...DIKSIEKEFSI....DFKEYFKEDL...KAL.EE.YKD......FVN......FDENF.IKVNETGVL QGYTTKKFTQTIGIGVTSIGEG.........GDYYTQNYKDLHHYEKALDLGHL..PVE..RG.VAL.SQE.DVL.RK..EVIMQM.MSNLKL...DYSKIEEKFSV....DFKAHFKKEL...EKL.KP.YEEA....GLLS......FNSKG.FEMTKTGGM QGYTTHAGTELFGFGATSISML.........HDAYVQNHKQLKEYYQAVAGDAL..PVS..KG.IKL.TTD.DIL.RR..DVIMCI.MSNFYL...HKQEIEDKYHI....NFDEYFSQEI...AAL.KP.LAAD....GLVS......LSSKH.IQVTEIGRL QGYTTQPESDLLGFGITSISML.........QDVYAQNHKTLKAFYNALDREVM..PIE..KG.FKL.SQD.DLI.RR..TVIKEL.MCQFKL...SAQELESKYNLGFDCDFNDYFAKEL...SAL.DV.LEAD....GLLR......RLGDG.LEVTPRGRI QGYTTLPTADLIGFGLTSISML.........QAAYAQNQKHLATYFSDVAAGHH.GPQE..CG.FNC.TVE.DLL.RR..TIIMEL.MCQFSL...DKGAIARQFNL....DFDAYFASEL...AAL.RE.LAAD....GLLH......LGRDR.LEVTPVGRL QGYTTKKGVELLGFGATSIGML.........YDSYFQNWKTLRDYNKTVDEGKI..PVF..RG.YVL.NED.DFI.RR..EVIMDI.MCNLGV...EFSKIENMFGI....NFREYFAKEL...EEL.KE.MEED....GLIK......VEEDR.IKIMPVGRL QGYSTYADCDLVAIGVSSIGKI.........GSTYSQNERDIDAYYAAIDEGRL..PIM..RG.YQL.NQD.DIL.RR..NIIQDL.MCRFAL...DYRIYESMFGI....PFDRYFKDEL...ADL.EK.LAGL....GLVR......LNSHG.LTVTPKGRF QGYSTHAGYDQVGLGISAIGAI.........AGRYVQNARTLDEYYGALDHGRL..PLA..RG.VAM.SAD.DHL.RR..EIIGAL.MCNGVL...DIPALEARHGI....RFGTAFAPEL...ADL.AA.LGAD....GLVQ......CAPDR.ITVTPLGRL QGYSTHADCDLLAFGMSAISRV.........GDVYAQNEKELDAYYARIDAGEL..PVL..RG.LTL.TPD.DHV.RR..ALIGEL.MCGFEL...DMRLFGTRHGL....NFRQSFASEL...TAL.AP.LEDA....GLVK......VGDER.IVITPQGRL MGYTTHADTDLLGLGVSAISHI.........GATYSQNPRDLPSWEDAVDQGQL..PVW..RG.VAL.SAD.DQL.RA..ELIQQL.MCQGEV...DGALLGQRHGV....DFEQYFAEDL...RSV.QR.LQDD....GLAE......YRHGV.VRASEPGRP MGYTTHADTDLIGLGVSAISHF.........GDSYSQNPRELAAWDAAVDRGAL..PVC..RG.MQL.SAD.DLL.RA..EVIQAL.LCRGRV...DLAAVAQRHQC....DARHCYDDAL...AAL.EL.LAAD....DLVE......VRGLC.VDVTATGWP QGYTTQEECDLLGLGVSAISLL.........GDTYAQNQKELKHYYTAIDNTGI..ALH..KG.FAM.SEE.DCL.RR..DVIKQL.ICNFKL...DYQPIEQQYGI....QFTSHFAEDL...KLL.AP.LSED....GLLE......IGEKA.IQVSAKGRL QGYTTQGDTDLLGMGVSAISMI.........GDCYAQNQKELKQYYQQVDEQGN..ALW..RG.IAL.TRD.DCI.RR..DVIKSL.ICNFRL...DYAPIEKQWDL....HFADYFAEDL...KLL.AP.LAKD....GLVD......VDEKG.IQVTAKGRL QGYTTQGESDLLGLGVSAISML.........GDSYAQNEKDLETYYACVEQRGN..ALW..RG.LTM.TED.DCL.RR..DVIKTL.ICHFQL...SYQPIEQRYGI....RFADYFAEDF...ELL.AP.FEQD....GLVE......RNETG.LRVTPRGRL QGYTTQEECDLLGLGVSSISQI.........GDCYAQNQKDIRPYYEAIDKDGH..ALW..KG.CSL.NRD.DEI.RR..VVIKQL.ICHFDL...DMAKIDEKLGI....KFEEYFAEDL...KLL.QT.FIDD....KLVE......VADRK.ITISPTGRL QGYTTQGECDLVGFGVSAISMI.........GDAYAQNQKELKKYYAQVNDLRH..ALW..KG.VSL.DSD.DLL.RR..EVIKQL.ICNFKL...DKKAIESEFRV....KFDQYFKEDL...QLL.QT.FIND....ELVE......VDDNE.IRVTLRGRL QGYTTHGHCDLIGLGVSAISQI.........GDLYCQNSSDLTAYQNSLASAQL..ATS..RG.LVC.NAD.DRL.RR..AVIQQL.ICNFKL...EFAEIEQQFNI....DFQGYFGALW...PQL.QG.MAED....GLIR......LERER.IEVLPAGRL QGYTTHGHCDLIGLGVSAISQI.........GDLYCQNSSDLNTYQDSLSNAQL..ATQ..RG.LLC.NHD.DRI.RR..AVIQQL.ICHFEL...DFEPIEQAFTL....DFRGYFNDLW...PEL.LT.LQRD....GLIS......LDDKG.IRILPAGRL QGYTTHGHCDLVGLGVSAISQI.........GDLYSQNSSDINDYQTSLDNGQL..AIR..RG.LHC.NSD.DRV.RR..AVIQQL.ICHFEL...AFEDIETEFGI....DFRSYFAELW...PDL.ER.FAAD....GLIR......LDAKG.IDITSSGRL QGYTTHGHCDLIGLGVSAISQV.........GDLYSQNSSDLNDYQRLLDSDQP..ATL..RG.LIC.SED.DRI.RR..AVIQQL.ICHFTL...NFSELEKAFAI....SFRDYFADAW...PQL.LC.MADD....GLIT......LSDSA.IEVRPAGRL QGYTTDNEPVLIGLGASAISTF.........SDAYIQNIADIKNYSRAIEEQGL..ASF..RG.IDI.SQE.DHL.RG..EIISAL.MCHFAV...DLTPYDKSLSL..........EDEK...REL.SH.LEEE....GLIQ......FQQNR.IEMTDAGRP QGYTNDRCGTLIGFGPSSISQF.........PGGYAQNISDVGQYRKRVEAGEL..ATV..RG.YTL.RDT.DRI.RS..AIISAL.MCNFCV...DLNAVAPGMEF..........SDEF...ALL.RP.LVAD....GLVA......VEGRT.IRATENGKS QGYTTDACETLIGFGASAIGRS.........AHGYVQNEVAIGRYAQSVATGQL..ATA..KG.YRL.TAD.DRL.RA..EIIERI.MCDFSV...DLASICQSHGV.....SPDTVVDGN...SQL.QR.LLAD....EIVT......LEDGI.LRLRGEERF QGYTTDACETLIGLGASAIGRT.........NDGYVQNEVPPGLYAQHIASGRL..ATV..KG.YRM.TPE.DRL.RA..GIIERL.MCDFGV...DVPALATAHGF.....DPEMLLRGN...TRL.AM.LESD....GILD......IADGV.IRLREGRRF QGYTTDACKTLIGIGASAIGRF.........GNGYHQNIVPPGLYASCVASGEL..PTA..KI.YEL.TAE.DRV.RA..DVIEQL.MCNFSV...NVAAVCAAHGF.....DPEVLMKQN...DTL.DE.LEKD....GLVQ......REGFM.VRVDGRHRF QGYSADTCKTLIAFGASAIGRV.........GEGYVENAGALEAYSQHIAAGRL..ATS..KG.YRL.IGE.DRV.RG..AIIERL.MCDLEA...DVPAICAAHGF.....DWTHFLDSA...ERL.AM.LADD....GIVD......VENGF.IRVRHGHRI LGYSADTCKTLIGFGASAIGRV.........GEGYVQNEVTRDSYCRHIAAGRL..ATS..KG.YRL.TDE.DRA.RA..AIIERL.MCDLEA...DVPAICAAHGS.....DPIHFLDSA...ERL.AM.LAKD....GIVD......IEKGF.VRVRRQHRF LGYSADTCKTVIGLGPSAIGRL.........REGYVQNESATASYHQHIQAGRP..ATS..KG.YCL.SPE.DRL.RA..AIIERL.MCDLQA...DVPAICAAHGF.....DPIPLLNSA...DRL.GM.LAED....GIVD......IEEGF.IRVKQEHRF LGYSAETCSTVIGLGASAIGRC.........GDGYVQNDLTQSCYNRHIASGRL..AIS..RG.YRL.ATE.DRV.RA..AIIEQL.MCYLEA...DISAICTAQGF.....DQTHLVSSA...KQL.EI.LAED....GIVE......FDNGL.VSVRHERRS QGYTTDQGEVLLGFGASAIGHL.........PQGYVQNEVQIGAYAQSIGASRL..ATA..KG.YGL.TDD.DRL.RA..DIIERI.MCEFSA...DLGDICARHGA.....EPEAMLKSA...SRL.KP.LISD....GVVR......LDGDR.LAVANDSRF QGYTTDDCDSLIGLGASAIGRL.........PAGYMQNHVPLGLYAERIAFGVL..PTA..KG.YLL.SEE.DKL.RA..RVIERL.MCDFEA...DLGQLSSGSGF.....DTGFLVERN...DRL.GE.LMAD....GVVT......ISGER.IVVCEEARF QGYTDDPAPVLVPIGPSSIGQF.........REGFVQNLTPTDAWAARIARDEL..PLG..RA.LAF.SDE.DRL.RA..AVIERL.MCDMTV...DVAAICEAHGF.....STDHLAGSL...ASL.AA.IEVA....GLCV......LDGAV.VTIPEDARR QGYTEDNCETLIGLGPSAISRY.........RQGYAQNIVATGAYEKVVDSGQL..AVA..RG.VEL.SVD.DLA.RG..WIIERL.MCHFAF...SAIELVERFGD.....VGQRLLAMA...SRL.AV.GGGG....LLLR......LDGEN.FVVPKDSRP QGYTEDRCETLIGLGSSSISRF.........RQGYSQNMPSTAEYRRMVEGGHL..ATV..RG.IAF.SED.DRV.RG..WIIERL.MCDFGF...SAADLVERFGE.....AGQKLLFQA...SSI.AI.GDPA....RPLE......LQGDS.YVVSAESRP QGYTTDTADALIGLGASSVGRL.........PQGYVQNMVATREYQRMVGEGGL..AAV..KG.IEL.SQD.DHL.RS..HVIERL.MCDFSI...DLSDMQHRFGK.....VSHSVRDQA...QQF.AA.GDRD....GVVR......LDADV.FAVTEVGKP QGYTDDRAEVLVGLGASSISRF.........PQGYAQNAPATGAHLARIRDGRF..STT..RG.HAF.SAE.DRW.RS..RMIEAL.MCDFEI...RAEEFIRDHGF......DAESLSRI...LTP.VA.AHFG....DMVD......ADASG.LRITPRGRP QGYTDDRAEVLIGLGASAISRF.........PQGFTQNAPSTSDHLRAIRSGRF..STA..RG.HVL.SDE.DRL.RG..RMIEQL.LCEFRI...SRAQILARFAV......APERLETL...FRT.CA.AAFP....GVVE......ITGHG.LEILEEGRP QGYTDDTCPTLLGIGASSISKF.........EQGYLQNTAATAAYIKSIEEGRL..PGY..RG.HRM.TEE.DYL.HG..RAIEMI.MCDFFL...DLPALRARFGE.....PAETMVPRI...AEA.AE.KFTP....FVTV......DADGS.MSIAKEGRA MGYTENTTQMMLALGASSISDT.........WYAFAQNERTDDRYMEEVNKGRF..PIM..RG.HLL.SDE.DLV.LR..RHILNL.MCRQET...SWEDPK.........LYTEELDIAR...YRL.ED.MEND....GIVV......LGEKS.VKVTEIGVP MGYTAKTTDMLLGLGVSAISDS.........WDCFHQNEKIVKKYQKRIYSEGF..ATL..RG.HKL.NEE.DLI.QR..SLILQL.STSGKV...IVPE..............EILREVR...LYL.AS.MEDD....TLVR......WEGNL.LSLTEKGRP ISYTAAPATPMIGLGVGAVGEI.........DGAMFWNDGSQAAWRNALRHLHL..PVS..QA.RPA.TPE.SVQ.RR..AAVERL.LCTLEL...AAAD.............AVGLEDGY...GRL.AA.REAE....GLVR......VLDDR.IVVTEAGRH LGYSDKPTRIVLGVGLGAVSEL.........PNLLSRNHTSLDAWHESLDNKMS..PTC..AG.VIF.TTV.EAK.QR..RLVHRL.SETLRA...PLTEFQG..............AEQQ...GLL.NQ.LQAE....GLVT......AESEW.VQVTDSGRF FGYAETRVSQTLGAGLGAVSEV.........GDIVAQNYIDMDAWHMALDRGHL..ATQ..YI.IDA.TDF.EIT.RR..SVMRRL.MCNTEV...PVSMVAQ..............PEVL...GLL.ES.LENQ....GYTQ......KQGSS.VHLTALGRS QG....ADCL..ALGSGAGGSL.........QGHAYMQHRSLDNYYRLIDSGQK..PLM..MM.TQA.SGE.HPW.RA..KLQSGI..EVGRL...DLSELI...............ADPY...P.L.MP.LISQWYQSNLLK......DNSFC.LRLTDSGRF QG....ADCL..AFGSGAGGSI.........NGYSWMNERNLQTWHESVAAGKK..PLM..MI.MRN.AER.NAQ.WR..HTLQSG.IETACV...PLDE............LTPHAEKLA...PLL.AQ.WHQK....GLSR......DASTC.LRLTNEGRF LFYWRNENYL..GLGVSAGGHI.........GRFRYVNASDLKEYEEKITKGEL..PYE..YV.HEN.TEE.EEA.LE..TVFMGL.RIKEGV...ELNR................VKILL...PLL.EK.LQKKY..PCYLK......VKNGK.IFLSEDGMN .TYWENKKYL..GVGLSAAGYL.........NNVRYKNFFNLKDYYNNLDRNIL..PID..EK.EIL.TEE.DIE.QY..RYLVGF.RLLNKI...IIPS.................EKYL...EKC.MS.LCKE....GYLL......EKENG.YILSHKGLM LVYWNNDEYY..GFGAGAHGYV.........GGVRYMNHGPLPKYLQAMEEGRR..PVF..ES.HHV.SRV.EQM.EE..QMFLGL.RKRSGV...EERVFVERFGV.......SMFSLYE...KQI.AQ.LVAR....CLLE......RTDDR.VRLTDEGLL LTYWNNEEYY..GIGAGAHSYV.........ERVRRVNIGPIKQYIAKVRETGL..PYR..EI.HQV.TWM.EQM.EE..EMFLGL.RKTEGV...SKQCFFEKFGR....DVHDVFGAAI...R...AE.HEK.....GLPE......ETATH.VRLTRRGRL ITYWSNEHYY..GFGAGAHGYV.........GNTRYSNFGPIKKYMEPLQENIL..PTF..QQ.KEL.TLK.EKM.EE..EMFLGL.RKVDGV...DKKHFKQKFGQ....DLDATFANAI...QKT.TA.KGW.......LE......NNEEN.VALTRSGRF ITYWDNEEYY..GIGAGASGYL.........AGIRYKNLGPVHHYLKAAPTEKR....I..NE.EVL.SKK.SQI.EE..EMFLGL.RKKSGV...LVEKFENKFKC........SFEKLY..GEQI.TE.LINQ....KLLY......NDRQR.IHMTDKGFE LMYWDNVEYY..GVGAGASGYL.........DGIRYRNRGPIQHYLKGVSEG.N..ARL..SE.EVL.SKN.EMM.EE..ELFLGL.RKKEGV...SIGKFEQKFGT........SFEKRY...GQIVQE.LQSD....GLLK......ENNGF.IQMTKKGLF LMYWDNVEYF..GCGAGASGYL.........NGIRYQNRVPIQHYLKAVEAG.N..ARL..NE.EVL.RKE.EMM.EE..ELFLGL.RKKTGV...SIQRFQEKFGI........SFEERY...GNIVRE.LQNQ....GLLV......KDDAF.VRMTKKGLF LMYWNNAEYF..GCGAGASGYV.........DGIRYRNRGPIQHYLKAIKEKRQ..ARF..QE.ERL.SQS.EKM.EE..ELFLGL.RKKSGI...SIQRFEDKFGL........PLMEVY..GQAI.DD.LEKD....GLIL......VEKDC.IRMSKKGLF LMYWDNAEYY..GIGAGASGYV.........NGVRYKNHGPIRHYLSAVEEGNA..CIT..ED.H.L.SQK.EQM.EE..EMFLGL.RKKSGV...SMARFEEKFGQ........SFAGLY...GEIVRD.LVQQ....GLMQ......IEGDH.VRMTKRGLF LVYWNNEHYY..GFGAGASSYL.........NQQRYKNFGPIQHYLNLLRNNQL..PII..ET.ENL.SFK.NQI.EE..ELFLGL.RKKEGV...SLHRFKEKFNL........ELTDLY...QEV.LP.ELFD...AQLLT......FKNDH.LKLTRKGLF .VYWFNEEYY..GFGAGASGYV.........DGVRYTNINPVNHYIKAINKESK..AIL..VS.NKP.SLT.ERM.EE..EMFLGL.RLNEGV...SSSRFKKKFDQ....SIESVFGQTI...NNL.KE....K....ELIV......EKNDA.IALTKRGKV .VYWLNEEYY..GFGAGASGYV.........NGVRYTNLNPVNHYIKAINEGKK..PIL..SE.TSP.TYN.ERM.EE..EMFLGL.RMNQGV...SKSRFKKKFNK....LIDEVFVETI...KDL.RC.R.......GLIK......EEGEF.ISLTERGKV LTYWNNDYYY..GFGAGAHGYI.........PGKRTSNSKPLGTYMRAAKEEGS..AID..EI.EEI.TKK.DQI.EE..ELFLQL.RKTSGI...DKKMFEQKYGV........SLEQLY..EKEL.QD.LLEQ....GLLR......LIDGN.YRLTDRGML LIYWELDNYI..GCGASAHSYF.........NGVRYRNINNVKKYIEQISKGNS..VVE..EN.HRN.LLK.EDM.EE..FMFLGL.RKTRGV...SIEEFKLKFNK....DIQEVYGDVI...K....K.YETI....GMII......LNEHR.VFLTERGMQ LAYWNMDNWI..GVGSAAASYI.........NGKRIKNISSVEKYINSINEKRE..AVE..EI.INN.SKN.DNM.EE..FMFMGL.RKINGI...DENEFKNRFSM........NINDVY..GEIL.NK.YIDE....GLLI......RESGR.IFLSEKGIE LIYWDLEEYI..GCGLAAHSFL.........KGYRYSNVHNIEDYIKLINENKN..IKI..NT.YKN.LTK.DTM.EE..FMFMGL.RKIKGI...NTEEFYKRFHK.......NIYEVYG...DII.KK.YINE....GLII......EKHGN.IFLSSIGIE ILYWECREYL..GFGAGAHSYF.........EGTRWNNVERIEKYIEAILKRKD..ARE..EI.INL.SFE.DKM.SE..FMFLGL.RMRKGV...CEEEFRKRFGI.......SMFERYE...EIF.IK.YEKM....GLIE......KDKDC.VRLTEKGID LAYWGAKDYL..GCGAGAVGCV.........ANERFFAKKLIENYIKDPLQRQV..........ETL.NKQ.DKR.LE..KLFLGL.RCVLGV...ELSF.................LDEN...K........VK....FLIE......ENKAF.I...KNNRL ..YWTDKPFL..GLGVSASQYL.........NGIRSKNFSRISHYLRAAHHHQP..TAE..SM.EEL.PPC.ERI.KE..ALALRL.RLCDPI...PFCM............FPEELVNEILMNPSI.RP.LFA...............INAQT.FSLNKQGRL LYYWTDRPFL..GLGVSASQYL.........HGERSKNYSHISHYLRAVRKN.L..PTQ..ETSEIL.PKK.ERI.KE..ALALRL.RLLEGA...DLAE............FPSTLISML...TQD.VK.LQN......LFS......VHGQC.LALNRQGRL LVYWKMEEFL..GVGVSAWGFY.........ENVRYGNTKNISKYVKFLKEDKK..PVE..FR.VQL.DET.ELE.KE..RIMLGL.RTTEGI...EEKYLK..............FVPEY...L...RD.F.........FE......VKGGR.LRIKEEHLL .VYWENRPYY..GFGMGAASYV.........EGKRFTRPRKTKEYYQWVQELIANHGVI..DW.EIT.PKA.DVL.LE..TLMLGL.RLADGV...SLAALTEEFGK.......EKIQELH...QCL.QP.YFTQ....GWVQ......VVGDR.LRLSDPDGF QVYWRNQSYY..GFGMGATSYL.........QHRRLSRPRTRREYYQWLQALPE..SLH..QG.SPD.SLW.DRW.LE..TLMLGL.RLRDGL...SLPALAD........EFPASWVEAL...QAA.AA.KISP....ALLS......LAGDR.LHLTQPEGF TSYWRGIPYL..GCGPSAHSFN.........GTTREWNVSSIDLYIKGIEGNQR...DF..ET.ENL.DQT.TRY.NE..FIITTI.RTVWGT...PIEKLKQEFGN.......ELWEYCR...KMS.AP.YLEN....GKLE......IHEGA.LRLTREGIF LNYWRFGDYL..GIGCGAHGKLSF....ADGRIVRTTKTKHPRGYLAALNNLAK..AYL..DS.EQL.VADQDKP.FE..FFMNRF.RLIEPC...PKADFTATTGL........TIDVIR...PTL.DW.ALSE....GYLS......EDDQH.WQITEKGKL LNYWRFGDYL..GIGCGSHGKLSF....ADGRIIRTTKIKHPKGYLAAHQNMVK..PYL..DS.EQL.VEEIDRP.FE..FFMNRF.RLMEAC...PKQDFI..........DTTGLPLSFI.ETTI.QW.AVEM....GYLN......DNETS.WQITEKGKL LNYWRFGDYL..GIGCGSHGKLSF....ADGRIIRTTKVKHPRGYLAAYQNMVK..PYL..HT.EQLVADE.DRP.FE..FFMNRF.RLMEAC...PKQDYV.........DTTGLPLSTI...QDT.IDWALEM....GYLS......ETETH.WQITEKGKL LNYWQFGDYL..GIGCGAHGKVTL...PEENRIIRTVKIKHPKGYLTA.DNY.....TF..EQ.TEV.AQE.DRA.LE..YLMNRL.RLMTPI...PKQEFEDRTGL.....PRDVLKDGM...EKA.KQ.R.......GLLT......ESAEH.WQLTNKGHM LNYWQFGDYL..GIGAGAHGKIS.....YPDRIERTVRRRHPNDYLALMQNRPS.EAVE..R..KTV.AAE.DLP.FE..FMMNAL.RLTDGV...PTAMLQERTGV........PSAKIM...AQI.ET.ARQK....GLLE......TDPAV.FRPTEKGRL LNYWQFGDYL..GIGPGAHGKLS.....FPHRVIRDMRHKHPETYLRQAETAGG..ATVVQEQ.REV.DAA.DLP.FE..FMLNAL.RLTDGF...PVTLFQERTGL........PLRGIE...REL.DA.AERR....GLLV......HDHAT.IRPTELGQR LNYWRFGDFI..GIGAGAHGKLTF....ADGRILRTWKTRLPKDYLN....LAK..PFR..AG.EKL.LPV.DELPFE..FLMNAL.RLTDGV...EAELFTQRTGL........PLAQLQ...EAR.RA.AEQK....GLLQ......VEPDR.LVATPRGQL LNYWAFGDFI..GIGAGAHGKLSH....PDGRIIRTWKTRLPKDYLNP..DKPF...QA..GS.KLL.PLD.ELP.FE..FLMNAL.RLTNGV...DAALFRERTGL........SLDSLA...EAR.RQ.AEQK....GLLH......EDPAR.LIATPQGQL LNYWRFGDYL..GIGAGAHGKISS...GAEAHVLRRWKHKHPQSYLAS..AGTA..ASI..GG.DEI.VPG.ERL.PF..EYMLNLLRLHEGF...RLSDFEASTGL.......AACAIEA...P.L.AR.AVAK....GWMR......QQDGR.VVPTELGRR LNYWRFGDYL..GIGAGAHGKISS...GAEQQVLRRWKHKHPQSYLASAGSA.A..AIG..GD.EHV.PAA.RLP.FE..YMLNLL.RLHEGF...RLSDFEACTGL..........PAQVL.QAPL.AR.AMAQ....GWLV......EQHGR.IVPTELGRR MGYWVDGDWW..GAGPGAHSHI.........GDRRFYNIKHPARYSAQIAAGEL..PIK..ET.EML.TAE.DHH.TE..RVMLGL.RLKQGV...PLNLFT...............PAAR...PVI.DR.HIAG....GLLH......VNALGNLAVTDAGRL MGYWVDGDWW..GAGPGAHSHI.........GDHRFYNVKHPARYSAQIAGGEL..PIM..DT.ELL.TAD.DHH.TE..RVMLGL.RLKQGL...PAGIFS...............PSAH...RVI.DR.HIDR....GLLH......RVGGN.IAVTDAGRL LGYWDGGQWW..GAGPGAHGYI.........GVTRWWNVKHPNTYAEILAGATL..PVA..GF.EQL.GAD.ALH.TE..DVLLKV.RLRQGL...PLARLG...............AAER...ERA.EA.VLAD....GLLD......YHGDR.LVLTGRGRL LVYWRGVDYV..GVGPGAHGRLA.....LPEGRAATTAHRAIKDYIAAVGDHGV...GF..QS.EIL.TPE.DAA.LE..RLVLGM.RIDAGV...GFDE.VAVLGL..........DPDV...AKV.RD.LVET....GLLV......EDRAR.LRATRAGRL LTYWRYGEYV..GVGPGAHGRFV.....EHGRRTVTIAERMPETWANLVEAKGH..GVT..GG.EIL.TRS.EEA.DE..FLLMGL.RLAEGI...DLARYEAFSGR..........GLSS...ARL.SV.LQGE....GLVA.....PIGNAR.LRATPAGMI LVYWRYGQYA..GIGPGAHGRFV.....ENDVRTVTMTEKHPETWLDHVERRGH..GII..EE.EYL.DGG.QEG.DE..FLMMGL.RLREGI...DLARYARLSGH..........AIDD...KRL.AK.LIAE....GMIE.....PMGGSL.IRATPDGAL LTYWRYGDYA..GIGPGAHGRLA.....IGSGKIATATERNPEAWLQRVEECGE..GLV..ER.ELL.DFE.AQA.DE..LLLMGL.RLREGV...DLAR.............WQTLSGRD...PDP.AR.EEF......LIEHGFIERIGNSR.LRCTPAGML LTYWRYGDYA..GIGPGAHGRLT.....RGASKLATATERHPETWLETVEREGH..GMV..DQ.ELL.GVD.EQA.DE..LLLMGL.RLREGI...DLAR................WSDLS..GRDL.DP.EKEE....FLLQHGFVERLGNSR.LRCTPSGML LVYWRGDEYA..GIGPGAHGRLD.....IDGIRHATATEKRPEAWLLRVETNGH..GVV..TD.DLL.NSE.ERA.DE..FLLMGL.RLAEGI...DPERYTALSGR..........ALDP...KRI.AL.LREE....GAIT......VDATGRLRVTSSGFP WTYWQCGQYL..GVGPGAHGRFMPQGAGGHTREARIQTL.EPDNWMKEVMLFGH..GTR..KR.VPL.GRL.ELL.EE..VLALGL.RTDVGITHQHWQQFEPQLTL......WDVFGANK...E.V.QE.LLER....GLLQ......LDHRG.LRCSWEGLA LAYWDLEDWK..AIGIGAYGFE.........KNVYYQNYGSYLNYYK...KNQN............W.NQK.DIY.LY..ILMMGL.RKIDGI...DLNR..............EINKKAY...EYF.KN.KINY....PLVT......IKDNK.LKANNVHIL LKYWTMEYYL..GIGPGAHGFL.........PSGRYSNPRNVDTY....KRKNF..SKE..YT.KPN.FYE.ELI.LSLFRLFQPI.LMESFY...ELIP.............DQSQTLDL...Q.L.KK.FQES....GLCE......FSNGI.FQWKPEAVL PF04055.9 Sample Data 180-540 QGYSLPPEEDLLGLGMTSTSFI.........RGIYLQNAKTLEEYHNTVLRGTF..ATV..KS.KIL.TED.DRI.RK..WAIHKL.MCTFTI QGYSLPPEEDLLGFGISATSFI.........RGIYLQNVKDLREYSETIQAGKL..ATV..KG.KIL.SQD.DKT.RK..WVIHTL.MCSFSL QGYTTKGGADLVGVGLTSIGEG.........QRHYAQNFKDMSSYEAALDRGVL..PFE..RG.VIL.SDD.DEL.RK..AVIMEL.MANFKL QGYTTKKFTQTIGIGVTSIGEG.........GDYYTQNYKDLHHYEKALDLGHL..PVE..RG.VAL.SQE.DVL.RK..EVIMQM.MSNLKL QGYTTHAGTELFGFGATSISML.........HDAYVQNHKQLKEYYQAVAGDAL..PVS..KG.IKL.TTD.DIL.RR..DVIMCI.MSNFYL QGYTTQPESDLLGFGITSISML.........QDVYAQNHKTLKAFYNALDREVM..PIE..KG.FKL.SQD.DLI.RR..TVIKEL.MCQFKL QGYTTLPTADLIGFGLTSISML.........QAAYAQNQKHLATYFSDVAAGHH.GPQE..CG.FNC.TVE.DLL.RR..TIIMEL.MCQFSL QGYTTKKGVELLGFGATSIGML.........YDSYFQNWKTLRDYNKTVDEGKI..PVF..RG.YVL.NED.DFI.RR..EVIMDI.MCNLGV QGYSTYADCDLVAIGVSSIGKI.........GSTYSQNERDIDAYYAAIDEGRL..PIM..RG.YQL.NQD.DIL.RR..NIIQDL.MCRFAL QGYSTHAGYDQVGLGISAIGAI.........AGRYVQNARTLDEYYGALDHGRL..PLA..RG.VAM.SAD.DHL.RR..EIIGAL.MCNGVL QGYSTHADCDLLAFGMSAISRV.........GDVYAQNEKELDAYYARIDAGEL..PVL..RG.LTL.TPD.DHV.RR..ALIGEL.MCGFEL MGYTTHADTDLLGLGVSAISHI.........GATYSQNPRDLPSWEDAVDQGQL..PVW..RG.VAL.SAD.DQL.RA..ELIQQL.MCQGEV MGYTTHADTDLIGLGVSAISHF.........GDSYSQNPRELAAWDAAVDRGAL..PVC..RG.MQL.SAD.DLL.RA..EVIQAL.LCRGRV QGYTTQEECDLLGLGVSAISLL.........GDTYAQNQKELKHYYTAIDNTGI..ALH..KG.FAM.SEE.DCL.RR..DVIKQL.ICNFKL QGYTTQGDTDLLGMGVSAISMI.........GDCYAQNQKELKQYYQQVDEQGN..ALW..RG.IAL.TRD.DCI.RR..DVIKSL.ICNFRL QGYTTQGESDLLGLGVSAISML.........GDSYAQNEKDLETYYACVEQRGN..ALW..RG.LTM.TED.DCL.RR..DVIKTL.ICHFQL QGYTTQEECDLLGLGVSSISQI.........GDCYAQNQKDIRPYYEAIDKDGH..ALW..KG.CSL.NRD.DEI.RR..VVIKQL.ICHFDL QGYTTQGECDLVGFGVSAISMI.........GDAYAQNQKELKKYYAQVNDLRH..ALW..KG.VSL.DSD.DLL.RR..EVIKQL.ICNFKL QGYTTHGHCDLIGLGVSAISQI.........GDLYCQNSSDLTAYQNSLASAQL..ATS..RG.LVC.NAD.DRL.RR..AVIQQL.ICNFKL QGYTTHGHCDLIGLGVSAISQI.........GDLYCQNSSDLNTYQDSLSNAQL..ATQ..RG.LLC.NHD.DRI.RR..AVIQQL.ICHFEL QGYTTHGHCDLVGLGVSAISQI.........GDLYSQNSSDINDYQTSLDNGQL..AIR..RG.LHC.NSD.DRV.RR..AVIQQL.ICHFEL QGYTTHGHCDLIGLGVSAISQV.........GDLYSQNSSDLNDYQRLLDSDQP..ATL..RG.LIC.SED.DRI.RR..AVIQQL.ICHFTL QGYTTDNEPVLIGLGASAISTF.........SDAYIQNIADIKNYSRAIEEQGL..ASF..RG.IDI.SQE.DHL.RG..EIISAL.MCHFAV QGYTNDRCGTLIGFGPSSISQF.........PGGYAQNISDVGQYRKRVEAGEL..ATV..RG.YTL.RDT.DRI.RS..AIISAL.MCNFCV QGYTTDACETLIGFGASAIGRS.........AHGYVQNEVAIGRYAQSVATGQL..ATA..KG.YRL.TAD.DRL.RA..EIIERI.MCDFSV QGYTTDACETLIGLGASAIGRT.........NDGYVQNEVPPGLYAQHIASGRL..ATV..KG.YRM.TPE.DRL.RA..GIIERL.MCDFGV QGYTTDACKTLIGIGASAIGRF.........GNGYHQNIVPPGLYASCVASGEL..PTA..KI.YEL.TAE.DRV.RA..DVIEQL.MCNFSV QGYSADTCKTLIAFGASAIGRV.........GEGYVENAGALEAYSQHIAAGRL..ATS..KG.YRL.IGE.DRV.RG..AIIERL.MCDLEA LGYSADTCKTLIGFGASAIGRV.........GEGYVQNEVTRDSYCRHIAAGRL..ATS..KG.YRL.TDE.DRA.RA..AIIERL.MCDLEA LGYSADTCKTVIGLGPSAIGRL.........REGYVQNESATASYHQHIQAGRP..ATS..KG.YCL.SPE.DRL.RA..AIIERL.MCDLQA LGYSAETCSTVIGLGASAIGRC.........GDGYVQNDLTQSCYNRHIASGRL..AIS..RG.YRL.ATE.DRV.RA..AIIEQL.MCYLEA QGYTTDQGEVLLGFGASAIGHL.........PQGYVQNEVQIGAYAQSIGASRL..ATA..KG.YGL.TDD.DRL.RA..DIIERI.MCEFSA QGYTTDDCDSLIGLGASAIGRL.........PAGYMQNHVPLGLYAERIAFGVL..PTA..KG.YLL.SEE.DKL.RA..RVIERL.MCDFEA
8 410 555 188 26 8 198 412 330 STA RPE Clusters of Interest
Mutual Information Analysis in Pfam • 2,765 families have one or more PDB structures • Filter on families with > 100 sequences and < 5000 sequences • At least one PDB structure must have 90% agreement with an associated Pfam sequence in the family • 783 families were used to test the predictive quality of the STA and RPE methods
Additional Filtering Percentage < 12 Angstroms per Pfam family Number of MI pairs where Z>=2
Structure Prediction CASP7 • Mutual Information prediction of co-evolving pairs is used to build a model to score a predicted tertiary structure • When a PDB exists for a particular Pfam family then we have accurate data to score the predicted structure • When no PDB exists then a predicted tertiary structure will score better when the sum of the distances between co-evolving pairs is minimum as compared to other predicted structures