1 / 42

Multiple Sequence Alignments Advanced BLAST searches

Multiple Sequence Alignments Advanced BLAST searches. June 17, 2014. Topics. Overview of MSA MSA methods Practical aspects MSA to Profiles PSI- BLAST PHI-BLAST. Overview of MSA. Alignment of ≥ 3 sequences to bring as many similar characters into register as possible

diane
Download Presentation

Multiple Sequence Alignments Advanced BLAST searches

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiple Sequence AlignmentsAdvanced BLAST searches June 17, 2014

  2. Topics • Overview of MSA • MSA methods • Practical aspects • MSA to Profiles • PSI-BLAST • PHI-BLAST

  3. Overview of MSA • Alignment of ≥ 3 sequences to bring as many similar characters into register as possible • Hypothetical model of mutations (substitutions, insertions & deletions) • Best represents most likely evolutionary scenario. • Cannot be unambiguously established

  4. MSA: Motivation • Correspondence. Find out which parts “do the same thing” • Similar genes are conserved across widely divergent species, often performing similar functions • Structure prediction • Use knowledge of structure of one or more members of a protein MSA to predict structure of other members • Structure is more conserved than sequence • Create “profiles” for protein families • Allow us to search for other members of the family • Genome assembly: • Automated reconstruction of “contig” maps of genomic fragments such as ESTs • MSA is the starting point for phylogenetic analysis

  5. MSA: Approaches • Optimal Global Alignments -Dynamic programming • Find alignment that maximizes a score function • Computationally expensive: Time grows as product of sequence lengths • Global Progressive Alignments - Match closely-related sequences first using a guide tree (CLUSTALW) • Global Iterative Alignments - Multiple re-building attempts to find best alignment (MUSCLE) • Local alignments • Profiles, Blocks, Patterns

  6. MSA algorithms • ClustalW • Hierachical & progressive (NOT iterative) • Uses guide trees • Can propagate errors made early in the alignment • Most common • Webserver & local (Bioedit, Genious) • MUSCLE • Progressive & iterative • Faster than CLUSTALW, especially on larger sequence sets • Command line & in Genious Pro, MacVector and MEGA5

  7. Overview of hierarchical method • Do a pair-wise comparison of all sequences • Create a guide tree of the most to least similar • Align 2 most similar, then next 2 most similar • Add sequences progressively in decreasing order of similarity • Gaps that are introduced are never removed

  8. Step 1-pairwise alignments Compare each sequence with each other and calculate a distance matrix. A - B .87 - C .59 .60 - Each number represents the number of exact matches divided by the sequence length (ignoring gaps). Thus, the higher the number the more closely related the two sequences are. Different sequences A B C In this distance matrix sequence A is 87% identical to sequence B

  9. Step 2-Create Guide Tree Use the Distance Matrix to create a Guide Tree to determine the “order” of the sequences. 0.87 (0.13) A - B .87 - C .59 .60 - A B C Different sequences 0.60 (0.40) A B C Guide Tree Branch length proportional to estimated divergence between A and B (0.13)

  10. A B C Step 3-Progressive Alignment First, align A and B Then add sequence C to the previous alignment. In the closely aligned sequences, gaps are given a heavier weight than in more divergent sequences. Guide Tree

  11. Amino acid weight matrices • Series of scoring matrices that one can use depending on the relatedness of the proteins aligned. • As the alignment proceeds in CLUSTALW the AA weight matrices are changed to more divergent scoring matrices. • Length of the branch is used to determine which matrix to use and contributes to the alignment score.

  12. Globin alignment • Starting with a group of 7 globin-related sequences from different species • Do pairwise alignments between all 7 sequences • Calculate similarity between each pair; higher score indicates more similar

  13. Cluster the sequences by similarity to create a guide tree • Branch length is proportional to estimated divergence between the two sequences

  14. Globin alignment

  15. ClustalWAlignment * identity : high similarity . low similarity - gap in sequence Amino acids often color coded based on physical -chemical properties

  16. ClustalWvs Muscle ClustalW alignment MUSCLE alignment

  17. Practical aspects • Identify & download sequences in correct format • Should meet criteria for MSA: • Closely related (E < 1e-10) • Similar length and number of domains • Same domain order • If necessary, extract regions of similar length • Name them appropriately • Short, descriptive names that fit on the output

  18. Alignment viewers • Edit and prepare for publication • Different coloring schemes • Jalview -- Java based interactive viewer (free)

  19. MSA -> Profiles • Profile: A table that lists the frequencies of each amino acid in each position of protein sequence. • Frequencies are calculated from a MSA containing a domain of interest • Allows us to identify consensus sequence • Derived scoring scheme allows us to align a new sequence to the profile • Profile can be used in database searches • Find new sequences that match the profile • Profiles also used to compute multiple alignments heuristically • Progressive alignment

  20. Why not just use BLAST? • Database searches using a profile or position-specific scoring matrices (PSSM) are much more sensitive for detecting weak or distant relationships than are database searches using a single sequence as query • Information content higher in a PSSM

  21. Pairwise alignment

  22. Position Specific Scoring Matrix (PSSM)

  23. MSAs -> PSSM POS 123456 Seq1 ATGTCG Seq2 AAGACT Seq3 TACTCA Seq4 CGGAGG Seq5 AACCTG

  24. ATGTCG AAGACT TACTCA CGGAGG AACCTG Convert MSA to raw frequency table

  25. Normalize by dividing by overall frequencies

  26. Convert the values to log to the base of 2 PSSM

  27. Match the string “AACTCG” to the matrix SUM: 1.0 + 1.0 + 0.8 + 1.0 + 1.38 + 1.15 = 6.33

  28. Match the string “AACTGG” to the matrix SUM: 1.0 + 1.0 + 0.8 + 1.0 - 0.43 + 1.15 = 4.52

  29. PSI-BLAST • Position-Specific Iterated BLAST • Can generate a position-specific scoring matrix staring from a single sequence against a single database • Builds the PSSM iteratively • Increases sensitivity of search with each iteration

  30. Steps in PSI-BLAST • Single protein sequence compared to database using BLASTP • Construct a multiple alignment and profile (PSSM) from any significant local alignments • query sequence is template • lengths all identical to query • Profile or PSSM is compared to database, making local alignments • Estimate statistical significance of local alignments • Iterate an arbitrary number of times or until convergence (no new sequences added)

  31. Practical uses of PSI-BLAST • Can create a PSSM using PSI-BLAST against 1 database • i.e. NR • Use the PSSM in a search of database for a more sensitive search • i.e. Refseq or NR restricted to taxonomic group • Does not have to run to convergence to create a PSSM useful for finding remote homologues, usually 2 or 3 iterations is sufficient • SLOW – use when there is no domain in your protein

  32. Delta-BLAST Domain enhanced lookup time accelerated BLAST Works when you are looking for proteins with a known protein domain

  33. Sma4 protein • 570 aa from C. elegans • Domain structure What homologs exist in C. briggsae? C. briggsae& C. elegans are both nematodes that diverged ~80-100 million years ago.

  34. Sma4 from C. elegans BLASTP against Refseq limited to C. briggsae DeltaBlastagainst Refseq limited to C. briggsae

  35. Sma4 vs TAG-68

  36. Similar functional role? • Sma4 (520 aa) • TAG-68 (415 aa)

  37. Where Sma4 homologs are not... Sma4 BLASTP against Refseq (fungi) Sma4 PSSM PSI-BLAST against Refseq (fungi)

  38. PHI-BLAST • Pattern-hit initiated BLAST • Enforces the presence of a motif in addition to the usual PSI-BLAST criteria for matching • Uses protein domain signatures from PROSITE database • Initiate a PSI-BLAST search, but include a signature pattern from PROSITE to limit search to sequences which contain that motif or signature

  39. PHI-BLAST example • PHI-BLAST • Query: E3 ubiquitin ligase ARIH2 (human) • Database: Refseq (Aspergillus) • Signature of ZF_RING_1, Zinc finger RING-type: C-x-H-x-[LIVMFY]-C-x(2)-C-[LIVMYA]

  40. PHI-BLAST, cont BLASTP of ARIH2 against Refseq (Aspergillus): PHI-BLAST of ARIH2 against Refseq (Aspergillus), with 1_ZN_RING signature:

  41. C-x-H-x-[LIVMFY]-C-x(2)-C-[LIVMYA] Prosite pattern: Human ARIH2 protein: Top Aspergillus match: CkHdFCwmCL CkHeFCwmCM

  42. This week in lab • Using BLASTP & MSA to predict functional homologs in other species • Compare results of BLASTP, PHI-BLAST and Delta-BLAST to identify homologs in other species • Using PSI-BLAST to identify remote homologs of proteins with no known domains

More Related