160 likes | 361 Views
Doug Raiford Lesson 5. PSI-BLAST and Multiple Sequence Alignments. Left off…. Dynamic programming methods Needleman-Wunsch (global alignment) Smith-Waterman (local alignment) BLAST. Fixed: best Linear: next best Polynomial (n 2 ): not bad Exponential (3 n ): very bad. But….
E N D
Doug Raiford Lesson 5 PSI-BLAST andMultiple Sequence Alignments
Left off… • Dynamic programming methods • Needleman-Wunsch (global alignment) • Smith-Waterman (local alignment) • BLAST Fixed: best Linear: next best Polynomial (n2): not bad Exponential (3n): very bad
But… • BLAST fast (linear) • But not as sensitive Speed Sensitivity
How improve sensitivity? • Similarity matrix • Especially with amino acids • Some amino acids have similar chemical characteristics • Similarity to all 8,000 3-mers calculated • Usually ~50 are above a threshold • All of these ~50 are considered hits when searching • Matrices • PAM (Point Accepted Mutation) • Built from observed substitution rates in closely related proteins • BLOSOM (BLOckSUbstitution Matrix) • Built from observed substitution rates in evolutionarily divergentproteins
Build own matrix on the fly • PSI-BLAST (Position Specific Iterative) • Align using default similarity matrix • At each query location build a Position Specific Scoring Matrix (PSSM) based upon observed search and alignment results • Repeat with new matrix until results no longer change PSI-BLAST Build sensitivity by specifying allowed similarity at each position Slower, but still faster than local alignment
Importance of sequence alignment • Central to bioinformatics • Need for • Phylogeny • Protein function • Protein structure • Structure function • Drug discovery
Conserved regions • Some parts of proteins are very important to maintain function • Must be similar from species to species • Can we spot these regions through alignment? atgccgca-actgccgcaggagatcaggactttcatgaatatcatcatgcgtggga-ttcag acctcgatacgtgccgcaggagatcaggactttcacct--tggatcatgcgaccgtacctac
Why is this important? • Often conserved regions are near active sights • Ligand binding sights (docking) • Protein-to-protein interface • Important regions for tertiary structure Ligand: small molecule, target of protein, e.g. O2 is the ligand for hemoglobin Substrate: a molecule upon which an enzyme acts
How can we improve detection? • What if we look at more proteins • Increase our confidence? • But how to go about performing multiple sequence alignment? atgccgca-actgccgcaggagatcaggactttcatgaatatcatcatgcgtggga-ttcag acctccatacgtgccccaggagatctggactttcacc---tggatcatgcgaccgtacctac t-atgg-t-cgtgccgcaggagatcaggactttca-gt--g-aatcatctgg-cgc--c-aa t--tcgt-ac-tgccccaggagatctggactttcaaa---ca-atcatgcgcc-g-tc-tat aattccgtacgtgccgcaggagatcaggactttcag-t--a-tatcatctgtc-ggc--tag
Exhaustively • Hyper-dimensional dynamic programming • Becomes exponential with respect to number of sequences • O(nL) with L = number of sequences
Progressive approach • Determine all pair-wise distances • Fast: number of l-mermatches • Slower: full global alignments • Start with closest pairand aligns • Then aligns the next closest to those two • And so on.. ClustalW: cluster-alignment
Aligning to a set of previously aligned sequences • Profile: matrix of real values, representing the probability of amino acids at each position in a corresponding multiple sequence alignment • A modification of the Smith/Waterman algorithm • Degree to which an aa is preferred is the degree of match between the profile and the sequence Consensus 1 M.ERS.HLPEG.PFAAALSGARFAAQSSGN.ASVL..DWNVLP.E 38 | : : : || : ::::: : |: | ::|: : | : OPSD_XENLA 1 MNG.GTE..EGPN.NFYVP.PMS...SN.NKTGVVRSP.P..PFD 33
Issues • Mistakes early in a progressive approach propagated throughout process • Once aligned not revisited • Iterative methods devised to revisit • Newest version of ClustalW (version 2) includes iteration • Other MSA apps • T-Coffee • PSalign • DIALIGN • MUSCLE
Visualizing with a motif logo • Height of letter represents how prevalent that letter is at that position
Bit Scores • Scores are affected by sequence lengths • If want scores that can be compared across different query lengths need to normalize • Term “bit” comes from fact that probabilities are stored as log2 values (binary, bit) • Done so can add across length of sequence instead of multiply Database Searches