350 likes | 522 Views
Sequence Alignment July 22, 2009. Charlie Whittaker Bioinformatics and Computing Core Facility The David H. Koch Institute for Integrative Cancer Research charliew@mit.edu http://luria.mit.edu/July_AP_2009/. Alignment and Bioinformatics . Widespread usage Sequence annotation
E N D
Sequence AlignmentJuly 22, 2009 Charlie Whittaker Bioinformatics and Computing Core Facility The David H. Koch Institute for Integrative Cancer Research charliew@mit.edu http://luria.mit.edu/July_AP_2009/
Alignment and Bioinformatics • Widespread usage • Sequence annotation • Homolog identification • Phylogenetic analysis • Hypothesis generation • Polymorphism analysis • Gene expression analysis • Copy number analysis • and others
What is sequence alignment? Local Alignment FNLDAEAPAVLSGPP--DF ||||...|.:..||| || FNLDEHHPRLFPGPPSADF Global Alignment ------------PPRVGGFNLDAEAPAVLSGPP--DFFGFSVEFYR | ||||...|.:..||| || FLPLVFLTGLCSP-----FNLDEHHPRLFPGPPSADF--------- Smith- Waterman 1981 Needleman-Wunsch1970 Arrange two strings according to rules (heuristics) so that desired relationships are identified and highlighted. Sequence 1: PPRVGGFNLDAEAPAVLSGPPDFFGFSVEFYR Sequence 2: FLPLVFLTGLCSPFNLDEHHPRLFPGPPSADF
How many different alignments are possible? Gaps need to be considered… The simplest case: Align sequence n to sequence m. m = A n = B There are 3 possible ways to align: A -A A- | or :: or :: B B- -B
A slightly more complex alignment m = aa n = bb Slide Courtesy of ZhipingWeng – UMass-Worchester
Potential Alignments The number of potential alignments is unmanageable. Brute force methods are impossible.
Dot Plots to Visualize Relationships • Arrange sequences along sides of a matrix. • Fill in cells according to identity. • Similar regions are diagonals. • Parallel diagonals are repeats. • Perpendicular diagonals are inverted repeats.
Dynamic Programming 3 step procedure that will produce an optimal alignment according to a set of prearranged rules. 1 - Initialization 2 - Matrix Fill 3 - Traceback -GAATTCAGTTA GGA-T-C-G--A Global Alignment – Needleman SB, Wunsch CD. (1970). "A general method applicable to the search for similarities in the amino acid sequence of two proteins". J Mol Biol 48 (3): 443–53 Local Alignment – Smith TF, Waterman MS (1981). "Identification of Common Molecular Subsequences". Journal of Molecular Biology 147: 195–197
Rules (heuristics) for a global alignment example Match = 1, Mismatch = 0 Gap Penalty = 0 Traceback starts in lower right hand corner and continues to the upper left hand corner (Global Alignment). In case of tie, traceback prioritized as follows: A T G C A 1 0 0 0 T 0 1 0 0 G 0 0 1 0 C 0 0 0 1 1 2 3 ***Work along in excel or on paper*** http://luria.mit.edu/July_AP_2009/DP_files/
Initialization Sequences to be aligned arranged on the edges Because the gap penalty is 0, all gap rows are filled with 0. If there was a gap penalty these rows would be gap penalty increments.
Matrix Fill Step 0+1 = 1 0-0 = 0 Si-1 j-1 Si j-1 0-0 = 0 Si-1 j-1 + mij Sij = max Si-1 j + w Si j-1 + w Si-1 j Sij WHERE: mij= match (1) /mismatch (0) score w = gap penalty (0)
Matrix Fill Step - 2 0+0 = 0 0-0 = 0 Si j-1 Si-1 j-1 1-0 = 1 Si-1 j-1 + mij Sij = max Si-1 j + w Si j-1 + w Si-1 j Sij WHERE: mij= match (1) /mismatch (0) score w = gap penalty (0)
Traceback Step 5+1 = 6 1 2 3 5-0 != 0 5-0 != 0 A | A Traceback Priority
Traceback Step - 2 4+1 = 5 1 2 3 4-0 != 5 5-0 = 5 Tie break Traceback Priority TA | -A
Traceback Step - 3 4+1 = 5 1 2 3 4-0 != 5 5-0 = 5 Tie break Traceback Priority TTA | --A
Traceback Step - 4 4+1 = 5 1 2 3 4-0 != 5 4-0 != 5 Traceback Priority GTTA | | G--A
Traceback Step - 5 3+1 = 4 1 2 3 3-0 != 4 4-0 = 4 Tie break Traceback Priority AGTTA | | -G--A
Rules (heuristics) for a local alignment example Match = 1, Mismatch = -1 Gap Penalty = -1 Minimum score at any position is 0 Traceback starts at upper left hand maximum and continues until 0 is encountered (Local Alignment). In case of tie, traceback prioritized as follows: A T G C A 1 -1 -1 -1 T -1 1 -1 -1 G -1 -1 1 -1 C -1 -1 -1 1 1 2 3 ***Work along in excel or on paper*** http://luria.mit.edu/July_AP_2009/DP_files/
Initialization Sequences to be aligned arranged on the edges The gap penalty is now -1 but 0 is the minimum due to the Smith-Waterman rule.
Matrix Fill Step 0+1 = 1 Si-1 j-1 Si j-1 0-1 = -1 ~ 0 0-1 = -1 ~ 0 Si-1 j-1 + mij Sij = max Si-1 j + w Si j-1 + w 0 Si-1 j Sij WHERE: mij= match (1) /mismatch (-1) score w = gap penalty (-10)
Matrix Fill Step - 2 0-1 = -1 ~ 0 0-1 = -1 ~ 0 Si j-1 Si-1 j-1 1-1 = 0 Si-1 j-1 + mij Sij = max Si-1 j + w Si j-1 + w 0 Si-1 j Sij WHERE: mij= match (1) /mismatch (0) score w = gap penalty (0)
Traceback Step 1+1 = 2 1 2 3 0-1 = 1 ~ 0 0-1 = 1 ~ 0 A | A Traceback Priority
Traceback Step - 2 0+1 = 1 1 2 3 1-1 != 1 0-1 != 1 Traceback Priority GA || GA Try starting the traceback at other maxima. Also try DP calculator: http://web.cecs.pdx.edu/~ps/CapStone03/dynvis/SimilarityApplet.html
Additional Complexity:Similarity Matrices and Gap Penalties # # This matrix was produced by "pam" Version 1.0.6 [28-Jul-93] # # PAM 30 substitution matrix, scale = ln(2)/2 = 0.346574 # # Expected score = -5.06, Entropy = 2.57 bits # # Lowest score = -17, Highest score = 13 # A R N D C Q E G H I L K M F P S T W Y V B Z X * A 6 -7 -4 -3 -6 -4 -2 -2 -7 -5 -6 -7 -5 -8 -2 0 -1 -13 -8 -2 -3 -3 -3 -17 R -7 8 -6 -10 -8 -2 -9 -9 -2 -5 -8 0 -4 -9 -4 -3 -6 -2 -10 -8 -7 -4 -6 -17 N -4 -6 8 2 -11 -3 -2 -3 0 -5 -7 -1 -9 -9 -6 0 -2 -8 -4 -8 6 -3 -3 -17 D -3 -10 2 8 -14 -2 2 -3 -4 -7 -12 -4 -11 -15 -8 -4 -5 -15 -11 -8 6 1 -5 -17 C -6 -8 -11 -14 10 -14 -14 -9 -7 -6 -15 -14 -13 -13 -8 -3 -8 -15 -4 -6 -12 -14 -9 -17 Q -4 -2 -3 -2 -14 8 1 -7 1 -8 -5 -3 -4 -13 -3 -5 -5 -13 -12 -7 -3 6 -5 -17 E -2 -9 -2 2 -14 1 8 -4 -5 -5 -9 -4 -7 -14 -5 -4 -6 -17 -8 -6 1 6 -5 -17 G -2 -9 -3 -3 -9 -7 -4 6 -9 -11 -10 -7 -8 -9 -6 -2 -6 -15 -14 -5 -3 -5 -5 -17 H -7 -2 0 -4 -7 1 -5 -9 9 -9 -6 -6 -10 -6 -4 -6 -7 -7 -3 -6 -1 -1 -5 -17 I -5 -5 -5 -7 -6 -8 -5 -11 -9 8 -1 -6 -1 -2 -8 -7 -2 -14 -6 2 -6 -6 -5 -17 L -6 -8 -7 -12 -15 -5 -9 -10 -6 -1 7 -8 1 -3 -7 -8 -7 -6 -7 -2 -9 -7 -6 -17 K -7 0 -1 -4 -14 -3 -4 -7 -6 -6 -8 7 -2 -14 -6 -4 -3 -12 -9 -9 -2 -4 -5 -17 M -5 -4 -9 -11 -13 -4 -7 -8 -10 -1 1 -2 11 -4 -8 -5 -4 -13 -11 -1 -10 -5 -5 -17 F -8 -9 -9 -15 -13 -13 -14 -9 -6 -2 -3 -14 -4 9 -10 -6 -9 -4 2 -8 -10 -13 -8 -17 P -2 -4 -6 -8 -8 -3 -5 -6 -4 -8 -7 -6 -8 -10 8 -2 -4 -14 -13 -6 -7 -4 -5 -17 S 0 -3 0 -4 -3 -5 -4 -2 -6 -7 -8 -4 -5 -6 -2 6 0 -5 -7 -6 -1 -5 -3 -17 T -1 -6 -2 -5 -8 -5 -6 -6 -7 -2 -7 -3 -4 -9 -4 0 7 -13 -6 -3 -3 -6 -4 -17 W -13 -2 -8 -15 -15 -13 -17 -15 -7 -14 -6 -12 -13 -4 -14 -5 -13 13 -5 -15 -10 -14 -11 -17 Y -8 -10 -4 -11 -4 -12 -8 -14 -3 -6 -7 -9 -11 2 -13 -7 -6 -5 10 -7 -6 -9 -7 -17 V -2 -8 -8 -8 -6 -7 -6 -5 -6 2 -2 -9 -1 -8 -6 -6 -3 -15 -7 7 -8 -6 -5 -17 B -3 -7 6 6 -12 -3 1 -3 -1 -6 -9 -2 -10 -10 -7 -1 -3 -10 -6 -8 6 0 -5 -17 Z -3 -4 -3 1 -14 6 6 -5 -1 -6 -7 -4 -5 -13 -4 -5 -6 -14 -9 -6 0 6 -5 -17 X -3 -6 -3 -5 -9 -5 -5 -5 -5 -5 -6 -5 -5 -8 -5 -3 -4 -11 -7 -5 -5 -5 -5 -17 * -17 -17 -17 -17 -17 -17 -17 -17 -17 -17 -17 -17 -17 -17 -17 -17 -17 -17 -17 -17 -17 -17 -17 1 Gap opening/extension and position specific gap penalties.
Commonly used sequence alignment tools KI production 7/8-7/16/09 7296853 Kb Massive database size requires speed. Different aligners for different jobs “All fast alignment programs that I am aware of break the alignment problem into two parts. Initially in a “search stage,” the program detects regions of the two sequences which are likely to be homologous. The program then in an “alignment stage” examines these regions in more detail and produces alignments...” WJ Kent, Genome Research (2002) 12:656-664
Pair-wise Global and Local Alignment needle and water: http://www.ebi.ac.uk/Tools/emboss/align/index.html Example Sequences: >Sequence1 PPRVGGFNLDAEAPAVLSGPPDFFGFSVEFYR >Sequence2 FLPLVFLTGLCSPFNLDEHHPRLFPGPPSADF *Command-line demo!
BLAST Algorithm Regions of similarity are rapidly identified by converting query sequences into “words” followed by database scan with the high scoring words. Groups of word matches meeting proximity parameters are combined are extended using dynamic programming methods to produce optimal local alignments.
BLAST – Searching databases for homologs Example query sequence: http://luria.mit.edu/July_AP_2009/a5.pep Tool: http://blast.ncbi.nlm.nih.gov/Blast.cgi
BLAT - Algorithm Regions of similarity are rapidly identified by comparing all k-mers in the query to all non-overlapping k-mers in the target. High identity matches are extended and overlapped according to gene-based heuristics for nucleotide queries (GT-AT junctions, intron/exon size) and dynamic programming for protein queries.
BLAT – Gapped Alignment to Genomes Example query sequence: http://luria.mit.edu/July_AP_2009/a5.seq Tool: http://genome.ucsc.edu/cgi-bin/hgBlat
Clustal Algorithm Pair-wise relationships between sequences in analysis set are calculated. The “fast and slow” method uses dynamic programming. These pair-wise distances are used to generate a weighted guide tree that informs subsequent group-wise alignment. Multiple sequence alignment is progressively built up according to the order of the sequences in the guide tree and using the dynamic programming algorithm. First the most closely related sequences are aligned, then this alignment is aligned to the next most closely related sequence or alignment.
Clustal – Multiple Sequence Alignment Example Sequence Set: http://luria.mit.edu/July_AP_2009/ints.pep Tool (ClustalXvsClustalW): http://www.ebi.ac.uk/Tools/clustalw2/index.html
Next-Gen Sequencing ~10,000,000 sequences per sample per run!
Maq - Algorithm First 28 bases of query sequence is used to query database. No gaps and up to 2 mismatches are allowed. Mapping quality is assessed summing quality scores of mismatched bases. Higher sums = lower quality. When read map equivalently to multiple positions, mapping quality is set to 0 and 1 random position is reported. Mate pair information is used to refine and improve mapping data. Refinement includes the use of the Smith-Waterman algorithm.