1 / 27

Multiple Alignments Motifs/Profiles

Multiple Alignments Motifs/Profiles. What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile?. Prev. reading: Ch 1-5 Assigned reading: Ch 6.4, 6.5, 6.6. BIO520 Bioinformatics Jim Lund. Information from Alignments.

kyrene
Download Presentation

Multiple Alignments Motifs/Profiles

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiple AlignmentsMotifs/Profiles • What is multiple alignment? • HOW does one do this? • WHY does one do this? • What do we mean by a motif or profile? Prev. reading: Ch 1-5 Assigned reading: Ch 6.4, 6.5, 6.6 BIO520 Bioinformatics Jim Lund

  2. Information from Alignments • Infer biological function • Conserved elements critical for function • Divergent elements relate to divergent function • Infer structure (2°, 3°) • Infer phylogeny • History • Evolutionary forces (selection…)

  3. How do I find similar sequences? DATABASE Alignment

  4. Multiple Alignment • Global, Optimal • Theory • Computation • Progressive Alignment

  5. Multiple Alignment: better alignments

  6. Alignment Methods/Programs • GAP (GCG suite) • Optimal Alignment • MSA • (nearly) Optimal Alignment • Clustal W/X • Progressive Alignment • PSI-BLAST • Searches for matching sequences iteratively • Search seq is invariant master for the alignment.

  7. MSA Strategy c(A)=c(Ai,j) Minimize score! • HUGE matrix(aa# of seqs)CRASH computer • time~product of sequence length • 1000x10,000 OK, but 200x200x200x200 NOT • Alignment procedure • nearly optimal--only considers a subset of all alignment) • weight sequences via distance • branch-and-bound algorithm

  8. Running MSA • Download and run it locally (UNIX): • http://www.ncbi.nlm.nih.gov/CBBresearch/Schaffer/genetic_analysis.html • On the internet: • http://searchlauncher.bcm.tmc.edu/multi-align/multi-align.html • Rerun on segments AFTER Clustal...

  9. Clustal Strategy • Rapid pairwise alignments each-to-each • Calculate distance matrix • Create guide tree (neighbor joining) • Align • Closest pairs first • Add pairs or align sub-alignments • Adjust similarity matrix as alignment proceeds • Add sequences • introduce gaps • gaps at loops, not inside known 2° structures • Dynamic gap weighting

  10. Clustal Strategy Pairwise alignments Guide tree Align

  11. Clustal W(X) Strategy1. Pairwise alignments The pairwise alignment number here is a dissimilarity measure.

  12. Clustal W(X) Strategy2. Unrooted neighbor tree (dendrogram)

  13. Clustal W(X) Strategy3. Guide tree

  14. Clustal W(X) Strategy4. Progressive alignment using guide tree

  15. Running Clustal W/X • WWW, Win, Mac, UNIX • http://www2.ebi.ac.uk/clustalw/ • Input • Multiple sequence file (PIR, FASTA,…) • Can FORCE alignments • Specify secondary structures • Considerations • Fast, easy, widely used • Divergent proteins OK (trees misleading)

  16. “The Right Proteins”GAPDH Rabbit KAENGKLVING-KAITIFQERDPANIKWGDAGAEYVVESTGVFTTMEKAGAHLKGGAKRV 117 Chick KAENGKLVING-HAITIFQERDPSNIKWADAGAEYVVESTGVFTTMEKAGAHLKGGAKRV 117 *********** :**********.:***.******************************* What do we learn?

  17. “The Right Proteins”GAPDH Rabbit KAENGKLVING-KAITIFQERDPANIKWGDAGAEYVVESTGVFTTMEKAGAHLKGGAKRV 117 Chick KAENGKLVING-HAITIFQERDPSNIKWADAGAEYVVESTGVFTTMEKAGAHLKGGAKRV 117 Human KAEDGKLVIDG-KAITIFQERDPENIKWGDAGTAYVVESTGVFTTMEKAGAHLKGGAKRI 118 Tobacco KVKDEKTLLFGEKSVRVFGIRNPEEIPWAEAGADFVVESTGVFTDKDKAAAHLKGGAKKV 110 Entamoeba EAGENAIIVNGHKIV-VKAERDPAQIGWGALGVDYVVESTGVFTTIPKAEAHIKGGAKKV 105 :. : :: * : : :*:* :* *. *. :********* ** **:*****::

  18. Alignment Interpretation • DNA sequences • >50% “worth looking at” (eyeball test) • ~75% needed for phylogeny • Polypeptide sequences • 80% similar=SAME tertiary structure • 30-80% domains=similar structure • 15-30% ???? • <15% short motifs

  19. Uses of Alignment • Understanding or predicting mutant function • Finding motifs in DNA or polypeptides • Directing experiments--e.g. PCR primers • Phylogeny

  20. “The Right Proteins” Rabbit KAENGKLVING-KAITIFQERDPANIKWGDAGAEYVVESTGVFTTMEKAGAHLKGGAKRV 117 Chick KAENGKLVING-HAITIFQERDPSNIKWADAGAEYVVESTGVFTTMEKAGAHLKGGAKRV 117 Human KAEDGKLVIDG-KAITIFQERDPENIKWGDAGTAYVVESTGVFTTMEKAGAHLKGGAKRI 118 Tobacco KVKDEKTLLFGEKSVRVFGIRNPEEIPWAEAGADFVVESTGVFTDKDKAAAHLKGGAKKV 110 Entamoeba EAGENAIIVNGHKIV-VKAERDPAQIGWGALGVDYVVESTGVFTTIPKAEAHIKGGAKKV 105 :. : :: * : : :*:* :* *. *. :********* ** **:*****:: PCR Primer Mutation tolerated

  21. Viewing and interpreting alignments • Color residues by property • Conservation in the alignment • Known properties • Substitution groups: STA, HY • Physiochemical property • charge • hydrophobicity • Programs for visualization • Jalview • AMAS • Alscript

  22. Viewing alignments JalView alignment viewer

  23. How to build multiple alignments • Find sequences to align (db search). • Choose which regions of each protein to include. • Sequences should be of similar lengths. • Run multiple alignment program. • Inspect multiple alignment for problems. • Regions with many gaps have aligned poorly. • Remove disruptive sequences and re-run alignment. • Add back remaining sequences avoiding disruption.

  24. Interpro • Pfam 7.3 (3865 domains), • PRINTS 33.0 (1650 fingerprints), • PROSITE 17.5 (1565 and 252 preliminary profiles), • ProDom 2001.3 (1346 domains), • SMART 3.1 (509 domains), • TIGRFAMs 1.2 (814 domains), • SWISS-PROT 40.27 (113470 entries), • TrEMBL 21.12 (685610 entries).

  25. InterproA database of protein families, domains and functional sites • PROSITE, home of regular expressions and profiles; • Pfam, SMART, TIGRFAMs, PIRSF, and SUPERFAMILY keepers of hidden Markov models(HMMs); • PRINTS, provider of fingerprints (groups of aligned, un-weighted motifs);

  26. Interpro

  27. NCBI CDD (Conserved Domain Database Domains from: • Pfam (Protein families) • A database of protein families that currently contains > 7973 entries. • SMART (a Simple Modular Architecture Research Tool) • More than 500 domain families found in signalling, extracellular and chromatin-associated proteins are detectable. • Domains are extensively annotated with respect to phyletic distributions, functional class, tertiary structures and functionally important residues. • COGs (Clusters of Orthologous Groups) • Proteins or groups of paralogs from at least 3 lineages that correspond to an ancient conserved domain

More Related