1 / 51

Sequence Based Analysis Tutorial

Sequence Based Analysis Tutorial. NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Information Resource at Georgetown University Medical Center. Retrieval, Sequence Search & Classification Methods. Retrieve protein info by text / UID Sequence Similarity Search

Download Presentation

Sequence Based Analysis Tutorial

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sequence Based Analysis Tutorial NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Information Resource at Georgetown University Medical Center

  2. Retrieval, Sequence Search & Classification Methods • Retrieve protein info by text / UID • Sequence Similarity Search • BLAST, FASTA, Dynamic Programming • Family Classification • Patterns, Profiles, Hidden Markov Models, Sequence Alignments, Neural Networks • Integrated Search and Classification System

  3. Sequence Similarity Search (I) • Based on Pair-Wise Comparisons • Dynamic Programming Algorithms • Global Similarity: Needleman-Wunch • Local Similarity: Smith-Waterman • Heuristic Algorithms • FASTA: Based on K-Tuples (2-Amino Acid) • BLAST: Triples of Conserved Amino Acids • Gapped-BLAST: Allow Gaps in Segment Pairs • PHI-BLAST: Pattern-Hit Initiated Search • PSI-BLAST: Position-Specific Iterated Search

  4. Sequence Similarity Search (II) • Similarity Search Parameters • Scoring Matrices – Based on Conserved Amino Acid Substitution • Dayhoff Mutation Matrix, e.g., PAM250 (~20% Identity) • Henikoff Matrix from Ungapped Alignments, e.g., BLOSUM 62 • Gap Penalty • Search Time Comparisons • Smith-Waterman: 10 Min • FASTA: 2 Min • BLAST: 20 Sec

  5. Feature Representation • Features of Amino Acids: Physicochemical Properties, Context (Local & Global) Features, Evolutionary Features • Alternative Amino Acids: Classification of Amino Acids To Capture Different Features of Amino Acid Residues

  6. Substitution Matrix • Likelihood of One Amino Acid Mutated into Another Over Evolutionary Time • Negative Score: Unlikely to Happen (e.g., Gly/Trp, -7) • Positive Score: Conservative Substitution (e.g., Lys/Arg, +3) • High Score for Identical Matches: Rare Amino Acids (e.g., Trp, Cys)

  7. BLAST BALST (Basic Local Alignment Search Tool) • Extremely fast • Robust • Most frequently used It finds very short segment pairs (“seeds”) between the query and the database sequence These seeds are then extended in both directions until the maximum possible score for extensions of this particular seed is reached

  8. BLAST Search • From BLAST Search Interface • Table-Format Result with BLAST Output and SSEARCH (Smith-Waterman) Pair-Wise Alignment Link to NCBI taxonomy Link to PIRSF report Links to iProClass and UniProtKB reports Click to see alignment Click to see SSearch alignment

  9. Blast Result & Pairwise Alignment BLAST Aligment

  10. Classification • What is classification? • Why do we need protein classification? • Different levels of classification • Basis for functional protein classification • How to classify a protein of unknown function?

  11. Group proteins according to the presence of a common domain C - x(2,4) - C - x(3) - [LIVMFYWC] - x(8) - H - x(3,5) - H The 2 C's and the 2 H's are zinc ligands Group proteins according to common domain architecture and length Group proteins according to common 3D structure Classification Databases • Protein motif • Protein domain • 3-D structure • Whole-protein

  12. Family Classification Methods • Based on Other Classification Information • Multiple Sequence Alignment (ClustalW) • ProSite Pattern Search • Profile Search • Hidden Markov Models (HMMs) Domain (Pfam); Whole protein (PIRSF) • Neural Networks

  13. How do you build a tree? • Pick sequences to align • Align them • Verify the alignment • Keep the parts that are aligned correctly • Build and evaluate a phylogenetic tree • Integrated Analysis

  14. Multiple Sequence Alignment • ClustalW • Progressive Pairwise Approach • Base on Exhaustive Pairwise Alignments • Neighbor Joining • Joining Order Corresponding to a Tree • Alignment Varies • Dependent on Joining Order

  15. Multiple Alignment and Tree • From Text/Sequence Search Result or ClustalW Alignment Interface

  16. Motif Patterns (Regular Expressions) • Signature Patterns for Functional Motifs ProClass Motif Alignments

  17. PIR Pattern Search • From Text/Sequence Search Result or Pattern Search Interface • One Query Sequence Against PROSITE Pattern Database • One Query Pattern (PROSITE or User-Defined) Against Sequence DB

  18. Pattern Search Result (I) • One Query Sequence Against PROSITE Pattern Database

  19. Display the query pattern 1 Sorting arrows 2 3 Links to iProClass and UniProtKB reports Link to NCBI taxonomy Link to PIRSF report Pattern Search Result (II) • One Query Pattern Against Sequence Database

  20. Profile Method • Profile: A Table of Scores to Express Family Consensus Derived from Multiple Sequence Alignments • Num of Rows = Num of Aligned Positions • Each row contains a score for the alignment with each possible residue. • Profile Searching • Summation of Scores for Each Amino Acid Residue along Query Sequence • Higher Match Values at Conserved Positions

  21. 1 PIRSF scan Shows PIRSF that the query belongs to • Search One Query Protein Against all the Full-length and Domain HMM models for the fully curated PIRSFs by HAMMER • The matched regions and statistics will be displayed. Statistical data for all domains Statistical data per domain Alignment with consensus sequence

  22. Secondary Structure Features • a Helix Patterns of Hydrophobic Residue Conservation Showing I, I+3, I+4, I+7 Pattern Are Highly Indicative of an a Helix (Amphipathic) • b Strands That Are Half Buried in the Protein Core Will Tend to Have Hydrophobic Residues at Positions I, I+2, I+4, I+6

  23. 3D Structure Proteins sharethesamefoldsuggestinghomology Beta B1 Crystallin Gamma Crystallin C

  24. Creation and Curation of PIRSFs

  25. Integrated Bioinformatics System for Function and Pathway Discovery • Data Integration • Associative Analysis

  26. UniProt Query Sequence Family Classification & Functional Analysis BLAST Search HMM Domain Search Top-Matched Superfamilies/Domains HMM Motif Search Pattern Search SignalP/TMHMM Predicated Superfamilies/Domains/Motifs/Sites/SignalPeptides/TMHs CLUSTALW SSEARCH Superfamily/Domain/Motif Alignments Family Relationships & Functional Features Analytical Pipeline

  27. Integrated Bioinformatics System • Global Bioinformatics Analysis of 1000’s of Genes and Proteins • Pathway Discovery, Target Identification

  28. Lab Section

  29. Text Search

  30. Text Search Result (I) Extend your search or start over Choose columns to be displayed Expand view Pre-computed BLAST Results Links to iProClass and UniProtKB reports Link to NCBI taxonomy Link to PIRSF report

  31. Text Search Result (III) Number of Related Seq. at 3 different E-value cut-offs

  32. Text Search Result (II) Extend your search or start over Choose columns to be displayed Curated domain architecture with links to Pfam database Link to PIRSF report Extent of family curation

  33. Peptide Search

  34. Sorting arrows Links to iProClass and UniProtKB reports Link to NCBI taxonomy Link to PIRSF report Matching peptide highlighted in the sequence Peptide Search & Results

  35. 1 Choose columns to be displayed 2 3 4 5 Links to iProClass and UniProtKB reports 6 Batch Retrieval Results (I) Retrieve more sequences

  36. 1 Retrieve more families 2 Choose columns to be displayed 3 4 5 6 Links PIRSF reports Curated domain architecture (N- to C- termini) with links to Pfam database Batch Retrieval Results (II)

  37. Blast Similarity Search

  38. Blast / Related Sequences Results

  39. Blast Result & Pairwise Alignment BLAST Aligment

  40. Pairwise Alignment

  41. Multiple AlignmentInteractive Phylogenetic Tree and Alignment

  42. Phylogenetic Tree and Alignment View

  43. Pattern Search (I)

  44. Display the query pattern Sorting arrows Links to iProClass and UniProtKB reports Link to NCBI taxonomy Link to PIRSF report Pattern Search (II)

  45. PIRSF scan

  46. PIRSF Report

  47. PIRSF Family Hierarchy

  48. Taxonomic Distribution & Phylogenetic Pattern

  49. Rabbit Alpha Crystallin A Chain An iProClass View of the entry See protein synonyms See IDs from different databases Pre-computed BLAST results

More Related