410 likes | 558 Views
Fast Sequence Search Multiple Sequence Alignment. Xiaole Shirley Liu STAT115/STAT215, 2010. Outline. Fast sequence search BLAST , statistical significance BLAST programs BLAT Global MSA ClustalW ClustalW features ClustalW example. Fast Sequence Similarity Search. Query. Sequence DB.
E N D
Fast Sequence SearchMultiple Sequence Alignment Xiaole Shirley Liu STAT115/STAT215, 2010
Outline • Fast sequence search • BLAST, statistical significance • BLAST programs • BLAT • Global MSA • ClustalW • ClustalW features • ClustalW example STAT115
Fast Sequence Similarity Search Query Sequence DB • Uses: • Map a sequence to sequenced genome • Infer unknown sequence function • Find family of proteins in an organism • Find homolog/ortholog in other organisms • Find sequence mutations or variations (SNP) STAT115
Fast Similar Sequence Search • Can we run Smith-Waterman between query and every DB sequence? • Yes, but too slow! • General approach • Break query and DB sequence to match subsequences • Extend the matched subsequences, filter hopeless sequences • Use dynamic programming to get optimal alignment STAT115
BLAST • Basic Local Alignment Search Tool • Altschul et al.J Mol Biol. 1990 • One of the most widely used bioinformatics applications • Alignment quality not as good as Smith-Waterman • But much faster, supported at NCBI with big computer cluster • For tutorials or information: http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html STAT115
BLAST Algorithm Steps • Query and DB sequences are optionally filtered to remove low-complexity regions • E.g. ACACACACA, TTTTTTTTT STAT115
BLAST Algorithm Steps • Query and DB sequences are optionally filtered to remove low-complexity regions • Break DB sequences into k-mer words and hash their locations to speed later searches • k is usually 11 for DNA/RNA and 3 for protein LPPQGLL LPP PPQ PQG QGL GLL STAT115
BLAST Algorithm Steps • Query and DB sequences are optionally filtered to remove low-complexity regions • Break DB sequences into k-mer words and hash their locations to speed later searches • Each k-mer in query find possible k-mers that matches well with it • “well” is evaluated by substitution matrices STAT115
Scoring K-mer Matches P E G P Q G 7 + 2 + 6 = 15 BLOSUM62 STAT115
BLAST Algorithm Steps • Only words with T cutoff score is kept • T is usually 11-13, ~ 50 words make T cutoff • Note: this is 50 words at every query position • For each DB sequence with a high scoring word, try to extend it in both ends Query: LP PQG LL DB seq: MP PEG LL HSP score 9 + 15 + 8 = 32 • Form HSP (High-scoring Segment Pairs) • Use BLOSUM to score the extended alignment • No gaps allowed STAT115
BLAST Algorithm Steps • Keep only statistically significant HSPs • Based on the scores of aligning 2 random seqs • Use Smith-Waterman algorithm to join the HSPs and get optimal alignment • Gaps are allowed default (-11, -1) STAT115
Statistical Significance Probability that a random alignment gets score like this or better pvalue • Local similarity scores follow extreme value distribution (Altschul et al, Nat. Genet. 1994) STAT115
Digression: hypothesis testing • Null hypothesis H0 (“nothing special”) • Like a “defendant” presumed innocent • Alternative hypothesis HA • Proven guilty if overwhelming evidences are present STAT115
Two Sample t-test • Statistical significance in the two sample problem Group 1: X1, X2, … Xn1 Group 2: Y1, Y2, … Yn2 • If Xi ~ Normal (μ1, σ12), Yi ~ Normal (μ2, σ22) • Null hypothesis of μ1= μ2 • Use Welch-t statistic • Check T table for p-val • A gene with small p-val (very big or small t) • Reject null • Significant difference between normal and MM Tongji 2009
Permutation Test • Non-parametric method for p-val calculation • Do not assume normal expression distribution • Do not assume the two groups have equal variance • Randomly permute sample label, calculate t to form the empirical null t distribution • For MM-study, (14 choose 5) = 2002 different t values from permutation • If the observed t extremely high/low differential expression with statistical significance Tongji 2009
Permutation Technique Compute T0 Compute T1 Compute T2 Compute T3 Compare T0 to T* set Tongji 2009
Statistical Significance Probability that a random alignment gets score like this or better pvalue • Local similarity scores follow extreme value distribution (Altschul et al, Nat. Genet. 1994) STAT115
Statistical Significance • Actual alignment score S can be normalized • m, n are query and DB length • K, are constants • Depends on substitution matrix and sequence composition • For typical amino acid and PAM250 matrix K = 0.09, = 0.229 STAT115
Statistical Significance • Normalized score s can be used to get p-value • When x > 2, probability can be approximated • Another quick check, raw score S/3 STAT115
BLAST Reporting • Report DB sequences above a threshold • E value: Number (instead of probability pvalue) of matches expected merely by chance • Usually [0.05, 10] threshold • Smaller E, more stringent • User selected (just for display): e.g. top 10, 50, 100 STAT115
Different BLAST Programs • If query is DNA, but known to be coding (e.g. cDNA) • Translate cDNA into protein • Zero gap-extension penalty STAT115
Seq2 Seq3 Seq1 Seq4 Psi-BLAST • Position Specific Iterative BLAST • Align high scoring hits in initial BLAST to construct a profile for the hits • Use profile for next iteration BLAST • Find remote homologs or protein families • FP sequences can degrade search quickly Query STAT115
Reciprocal Blast • Search for orthologous sequences between two species • Orthologs: genes related by vertical descent from a common ancestor and encode proteins with the same function in different species • Paralogs: homologous genes evolved by duplication and code for protein with similar but not identical functions • Finding the correct orthologous sequence is very important in comparative genomics STAT115
Reciprocal Blast • Search for orthologous sequences between two species • Orthologs: genes related by vertical descent from a common ancestor and encode proteins with the same function in different species • GeneA in Species1 BLAST Species2 GeneB • GeneB in Species2 BLAST Species1 GeneA • GeneAGeneB • Also called bi-directional best hit orthologous STAT115
BLAT • BLAST-Like Alignment Tool • Compare to BLAST, BLAT can align much longer regions (MB) really fast with little resources • E.g. can map a sequence to the genome in seconds on one Linux computer • Allow big gaps (mRNA to genome) • Need higher similarity (> 95% for DNA and 80% for proteins) for aligned sequences • Basic approach • Break long sequence into blocks • Index k-mers, typically 8-13 • Stitch blocks together for final alignment STAT115
BLAT: Indexing Genome:cacaattatcacgaccgc 3-mers: cac aat tat cac gac cgc Index: aat 3 gac 12 cac 0,9 tat 6 cgc 15 cDNA (mRNA -> DNA): aattctcac 3-mers: aat att ttc tct ctc tca cac 0 1 2 3 4 5 6 hits: aat 0,3 -3 cac 6,0 6 cac 6,9 -3 clump: cacAATtatCACgaccgc STAT115
BLAT Example • Get result instantly!! STAT115
Multiple Sequence Alignment • MSA Uses: • Establish evolutionary relationships (global) • Find conserved nucleotides and amino acids (global) • Characterize signature protein patterns or motifs (local) • Find acceptable substitutions (local) • Protein MSA gold standard: structural alignment STAT115
Progressive MSA Method • Progressive: • Heuristic algorithm: approximation strategy, do not aim at perfect • Build alignment with most related sequences, progressively add less-related to the alignment • Often manual examination can improve alignments • ClustalW, NAR 1994 • W stands for weighting: more distant seqs weigh more • Reflect evolutionary distance STAT115
ClustalW Steps • Global pairwise alignment for all pairs • Calculate pairwise sequence distances • Approximate evolutionary distance STAT115
C A 3 1 2 1 1 D B ClustalW Steps • Construct a tree based on sequence distances • E.g. solve the following matrix A B C D A 4 6 2 B 4 4 C 6 D STAT115
C A 3 1 2 E F 1 1 D B ClustalW Steps • Progressively add sequences/alignments by the tree order • Starting from the smallest distance • Add seq to seq, seq to align, align to align • AD form new node E, calc AE, DE distance • Calc E consensus, weighted by AE DE distance • Calc B, C, E pairwise distance • BE form new node F… STAT115
ClustalW Features: Consensus • Consensus is used to represent the aligned sequences; if different, find AA to maximize score • Final score weighted based on branch length • Weight for A = a = 0.2 + 0.3 / 2 + 0.3/3 = 0.45 • Weight for B = b = 0.1 + 0.3 / 2 + 0.3/3 = 0.35 • Weight for C = c = 0.5 + 0.3/3 = 0.6 0.2 A B C 0.3 0.1 0.3 0.5 STAT115
Scoring an Alignment Sequence A (weight a) …K… Sequence B (weight b) …I… Merge and align to Sequence C (weight c) …L… Sequence D (weight d) …V… Score for aligning the column [a c Score(K,L) + a d Score(K,V) + b c Score(I,L) + b d Score(I,V)] / 4 STAT115
ClustalW Features: Gaps • Sequence specific gap penalties • Penalize gaps more in segments that are less likely to have gaps STAT115
Progressive Alignment Limitations • Gaps can proliferate, if not careful Align1: ABCD-E ABC-D-E Align2: ABC-DE ABC-D-E • Need many heuristic parameters • Does not guarantee global optimum • Errors in initial alignments are propagated • Manual improvements: • Shift residues from one side of gap to the other • Reduce gaps STAT115
ClustalW Alignment * - identical : - conserved . - semi-conserved STAT115
ClustalW Tree Branch length ~ distance 0.02318 0.41596 0.10523 0.01824 0.12694 0.02011 0.01147 STAT115
Summary • Fast sequence similarity search • Break seq, hash DB sub-seq, match sub-seq and extend, use DP for optimal alignment • *BLAST, most widely used, many applications with sound statistical foundations • *BLAT, align sequence to genome, fast yet need higher similarity • Protein global MSA • Progressive heuristic alignment • ClustalW: pairwise, tree, merge alignments • Merge with minimum edit, sequence weighting, sequence/position specific gaps STAT115
Acknowledgment • David Mount • Aoife McLysaght • Ir. Brecht Claerhout STAT115