Fast Sequence Search Multiple Sequence Alignment

Fast Sequence SearchMultiple Sequence Alignment Xiaole Shirley Liu STAT115/STAT215, 2010

Outline • Fast sequence search • BLAST, statistical significance • BLAST programs • BLAT • Global MSA • ClustalW • ClustalW features • ClustalW example STAT115

Fast Sequence Similarity Search Query Sequence DB • Uses: • Map a sequence to sequenced genome • Infer unknown sequence function • Find family of proteins in an organism • Find homolog/ortholog in other organisms • Find sequence mutations or variations (SNP) STAT115

Fast Similar Sequence Search • Can we run Smith-Waterman between query and every DB sequence? • Yes, but too slow! • General approach • Break query and DB sequence to match subsequences • Extend the matched subsequences, filter hopeless sequences • Use dynamic programming to get optimal alignment STAT115

BLAST • Basic Local Alignment Search Tool • Altschul et al.J Mol Biol. 1990 • One of the most widely used bioinformatics applications • Alignment quality not as good as Smith-Waterman • But much faster, supported at NCBI with big computer cluster • For tutorials or information: http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html STAT115

BLAST Algorithm Steps • Query and DB sequences are optionally filtered to remove low-complexity regions • E.g. ACACACACA, TTTTTTTTT STAT115

BLAST Algorithm Steps • Query and DB sequences are optionally filtered to remove low-complexity regions • Break DB sequences into k-mer words and hash their locations to speed later searches • k is usually 11 for DNA/RNA and 3 for protein LPPQGLL LPP PPQ PQG QGL GLL STAT115

BLAST Algorithm Steps • Query and DB sequences are optionally filtered to remove low-complexity regions • Break DB sequences into k-mer words and hash their locations to speed later searches • Each k-mer in query find possible k-mers that matches well with it • “well” is evaluated by substitution matrices STAT115

Scoring K-mer Matches P E G P Q G 7 + 2 + 6 = 15 BLOSUM62 STAT115

BLAST Algorithm Steps • Only words with  T cutoff score is kept • T is usually 11-13, ~ 50 words make T cutoff • Note: this is 50 words at every query position • For each DB sequence with a high scoring word, try to extend it in both ends Query: LP PQG LL DB seq: MP PEG LL HSP score 9 + 15 + 8 = 32 • Form HSP (High-scoring Segment Pairs) • Use BLOSUM to score the extended alignment • No gaps allowed STAT115

BLAST Algorithm Steps • Keep only statistically significant HSPs • Based on the scores of aligning 2 random seqs • Use Smith-Waterman algorithm to join the HSPs and get optimal alignment • Gaps are allowed default (-11, -1) STAT115

Statistical Significance Probability that a random alignment gets score like this or better  pvalue • Local similarity scores follow extreme value distribution (Altschul et al, Nat. Genet. 1994) STAT115

Digression: hypothesis testing • Null hypothesis H0 (“nothing special”) • Like a “defendant” presumed innocent • Alternative hypothesis HA • Proven guilty if overwhelming evidences are present STAT115

Two Sample t-test • Statistical significance in the two sample problem Group 1: X1, X2, … Xn1 Group 2: Y1, Y2, … Yn2 • If Xi ~ Normal (μ1, σ12), Yi ~ Normal (μ2, σ22) • Null hypothesis of μ1= μ2 • Use Welch-t statistic • Check T table for p-val • A gene with small p-val (very big or small t) • Reject null • Significant difference between normal and MM Tongji 2009

Permutation Test • Non-parametric method for p-val calculation • Do not assume normal expression distribution • Do not assume the two groups have equal variance • Randomly permute sample label, calculate t to form the empirical null t distribution • For MM-study, (14 choose 5) = 2002 different t values from permutation • If the observed t extremely high/low  differential expression with statistical significance Tongji 2009

Permutation Technique Compute T0 Compute T1 Compute T2 Compute T3 Compare T0 to T* set Tongji 2009

Statistical Significance Probability that a random alignment gets score like this or better  pvalue • Local similarity scores follow extreme value distribution (Altschul et al, Nat. Genet. 1994) STAT115

Statistical Significance • Actual alignment score S can be normalized • m, n are query and DB length • K,  are constants • Depends on substitution matrix and sequence composition • For typical amino acid and PAM250 matrix K = 0.09,  = 0.229 STAT115

Statistical Significance • Normalized score s can be used to get p-value • When x > 2, probability can be approximated • Another quick check, raw score S/3 STAT115

BLAST Reporting • Report DB sequences above a threshold • E value: Number (instead of probability  pvalue) of matches expected merely by chance • Usually [0.05, 10] threshold • Smaller E, more stringent • User selected (just for display): e.g. top 10, 50, 100 STAT115

Different BLAST Programs • If query is DNA, but known to be coding (e.g. cDNA) • Translate cDNA into protein • Zero gap-extension penalty STAT115

Seq2 Seq3 Seq1 Seq4 Psi-BLAST • Position Specific Iterative BLAST • Align high scoring hits in initial BLAST to construct a profile for the hits • Use profile for next iteration BLAST • Find remote homologs or protein families • FP sequences can degrade search quickly Query STAT115

Reciprocal Blast • Search for orthologous sequences between two species • Orthologs: genes related by vertical descent from a common ancestor and encode proteins with the same function in different species • Paralogs: homologous genes evolved by duplication and code for protein with similar but not identical functions • Finding the correct orthologous sequence is very important in comparative genomics STAT115

Reciprocal Blast • Search for orthologous sequences between two species • Orthologs: genes related by vertical descent from a common ancestor and encode proteins with the same function in different species • GeneA in Species1 BLAST Species2 GeneB • GeneB in Species2 BLAST Species1  GeneA • GeneAGeneB • Also called bi-directional best hit orthologous STAT115

BLAT • BLAST-Like Alignment Tool • Compare to BLAST, BLAT can align much longer regions (MB) really fast with little resources • E.g. can map a sequence to the genome in seconds on one Linux computer • Allow big gaps (mRNA to genome) • Need higher similarity (> 95% for DNA and 80% for proteins) for aligned sequences • Basic approach • Break long sequence into blocks • Index k-mers, typically 8-13 • Stitch blocks together for final alignment STAT115

BLAT: Indexing Genome:cacaattatcacgaccgc 3-mers: cac aat tat cac gac cgc Index: aat 3 gac 12 cac 0,9 tat 6 cgc 15 cDNA (mRNA -> DNA): aattctcac 3-mers: aat att ttc tct ctc tca cac 0 1 2 3 4 5 6 hits: aat 0,3 -3 cac 6,0 6 cac 6,9 -3 clump: cacAATtatCACgaccgc STAT115

BLAT Example • Get result instantly!! STAT115

Multiple Sequence Alignment • MSA Uses: • Establish evolutionary relationships (global) • Find conserved nucleotides and amino acids (global) • Characterize signature protein patterns or motifs (local) • Find acceptable substitutions (local) • Protein MSA gold standard: structural alignment STAT115

Progressive MSA Method • Progressive: • Heuristic algorithm: approximation strategy, do not aim at perfect • Build alignment with most related sequences, progressively add less-related to the alignment • Often manual examination can improve alignments • ClustalW, NAR 1994 • W stands for weighting: more distant seqs weigh more • Reflect evolutionary distance STAT115

ClustalW Steps • Global pairwise alignment for all pairs • Calculate pairwise sequence distances • Approximate evolutionary distance STAT115

C A 3 1 2 1 1 D B ClustalW Steps • Construct a tree based on sequence distances • E.g. solve the following matrix A B C D A 4 6 2 B 4 4 C 6 D STAT115

C A 3 1 2 E F 1 1 D B ClustalW Steps • Progressively add sequences/alignments by the tree order • Starting from the smallest distance • Add seq to seq, seq to align, align to align • AD form new node E, calc AE, DE distance • Calc E consensus, weighted by AE DE distance • Calc B, C, E pairwise distance • BE form new node F… STAT115

ClustalW Features: Consensus • Consensus is used to represent the aligned sequences; if different, find AA to maximize score • Final score weighted based on branch length • Weight for A = a = 0.2 + 0.3 / 2 + 0.3/3 = 0.45 • Weight for B = b = 0.1 + 0.3 / 2 + 0.3/3 = 0.35 • Weight for C = c = 0.5 + 0.3/3 = 0.6 0.2 A B C 0.3 0.1 0.3 0.5 STAT115

Scoring an Alignment Sequence A (weight a) …K… Sequence B (weight b) …I… Merge and align to Sequence C (weight c) …L… Sequence D (weight d) …V… Score for aligning the column [a  c  Score(K,L) + a  d  Score(K,V) + b  c  Score(I,L) + b  d  Score(I,V)] / 4 STAT115

ClustalW Features: Gaps • Sequence specific gap penalties • Penalize gaps more in segments that are less likely to have gaps STAT115

Progressive Alignment Limitations • Gaps can proliferate, if not careful Align1: ABCD-E ABC-D-E Align2: ABC-DE ABC-D-E • Need many heuristic parameters • Does not guarantee global optimum • Errors in initial alignments are propagated • Manual improvements: • Shift residues from one side of gap to the other • Reduce gaps STAT115

ClustalW Alignment * - identical : - conserved . - semi-conserved STAT115

ClustalW Tree Branch length ~ distance 0.02318 0.41596 0.10523 0.01824 0.12694 0.02011 0.01147 STAT115

Summary • Fast sequence similarity search • Break seq, hash DB sub-seq, match sub-seq and extend, use DP for optimal alignment • *BLAST, most widely used, many applications with sound statistical foundations • *BLAT, align sequence to genome, fast yet need higher similarity • Protein global MSA • Progressive heuristic alignment • ClustalW: pairwise, tree, merge alignments • Merge with minimum edit, sequence weighting, sequence/position specific gaps STAT115

Acknowledgment • David Mount • Aoife McLysaght • Ir. Brecht Claerhout STAT115

Fast Sequence Search Multiple Sequence Alignment

Fast Sequence Search Multiple Sequence Alignment

Presentation Transcript

Multiple sequence alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple sequence alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment