PSI-BLAST and Multiple Sequence Alignments

Doug Raiford Lesson 5 PSI-BLAST andMultiple Sequence Alignments

Left off… • Dynamic programming methods • Needleman-Wunsch (global alignment) • Smith-Waterman (local alignment) • BLAST Fixed: best Linear: next best Polynomial (n2): not bad Exponential (3n): very bad

But… • BLAST fast (linear) • But not as sensitive Speed Sensitivity

How improve sensitivity? • Similarity matrix • Especially with amino acids • Some amino acids have similar chemical characteristics • Similarity to all 8,000 3-mers calculated • Usually ~50 are above a threshold • All of these ~50 are considered hits when searching • Matrices • PAM (Point Accepted Mutation) • Built from observed substitution rates in closely related proteins • BLOSOM (BLOckSUbstitution Matrix) • Built from observed substitution rates in evolutionarily divergentproteins

Build own matrix on the fly • PSI-BLAST (Position Specific Iterative) • Align using default similarity matrix • At each query location build a Position Specific Scoring Matrix (PSSM) based upon observed search and alignment results • Repeat with new matrix until results no longer change PSI-BLAST Build sensitivity by specifying allowed similarity at each position Slower, but still faster than local alignment

Importance of sequence alignment • Central to bioinformatics • Need for • Phylogeny • Protein function • Protein structure • Structure  function • Drug discovery

Conserved regions • Some parts of proteins are very important to maintain function • Must be similar from species to species • Can we spot these regions through alignment? atgccgca-actgccgcaggagatcaggactttcatgaatatcatcatgcgtggga-ttcag acctcgatacgtgccgcaggagatcaggactttcacct--tggatcatgcgaccgtacctac

Why is this important? • Often conserved regions are near active sights • Ligand binding sights (docking) • Protein-to-protein interface • Important regions for tertiary structure Ligand: small molecule, target of protein, e.g. O2 is the ligand for hemoglobin Substrate: a molecule upon which an enzyme acts

How can we improve detection? • What if we look at more proteins • Increase our confidence? • But how to go about performing multiple sequence alignment? atgccgca-actgccgcaggagatcaggactttcatgaatatcatcatgcgtggga-ttcag acctccatacgtgccccaggagatctggactttcacc---tggatcatgcgaccgtacctac t-atgg-t-cgtgccgcaggagatcaggactttca-gt--g-aatcatctgg-cgc--c-aa t--tcgt-ac-tgccccaggagatctggactttcaaa---ca-atcatgcgcc-g-tc-tat aattccgtacgtgccgcaggagatcaggactttcag-t--a-tatcatctgtc-ggc--tag

Exhaustively • Hyper-dimensional dynamic programming • Becomes exponential with respect to number of sequences • O(nL) with L = number of sequences

Progressive approach • Determine all pair-wise distances • Fast: number of l-mermatches • Slower: full global alignments • Start with closest pairand aligns • Then aligns the next closest to those two • And so on.. ClustalW: cluster-alignment

Aligning to a set of previously aligned sequences • Profile: matrix of real values, representing the probability of amino acids at each position in a corresponding multiple sequence alignment • A modification of the Smith/Waterman algorithm • Degree to which an aa is preferred is the degree of match between the profile and the sequence Consensus 1 M.ERS.HLPEG.PFAAALSGARFAAQSSGN.ASVL..DWNVLP.E 38 | : : : || : ::::: : |: | ::|: : | : OPSD_XENLA 1 MNG.GTE..EGPN.NFYVP.PMS...SN.NKTGVVRSP.P..PFD 33

Issues • Mistakes early in a progressive approach propagated throughout process • Once aligned not revisited • Iterative methods devised to revisit • Newest version of ClustalW (version 2) includes iteration • Other MSA apps • T-Coffee • PSalign • DIALIGN • MUSCLE

Visualizing with a motif logo • Height of letter represents how prevalent that letter is at that position

Bit Scores • Scores are affected by sequence lengths • If want scores that can be compared across different query lengths need to normalize • Term “bit” comes from fact that probabilities are stored as log2 values (binary, bit) • Done so can add across length of sequence instead of multiply Database Searches

PSI-BLAST and Multiple Sequence Alignments