440 likes | 577 Views
Lecture 6 Sequence Alignment. Bioinformatics. Dr. Aladdin Hamwieh Khalid Al- shamaa Abdulqader Jighly. Aleppo University Faculty of technical engineering Department of Biotechnology. 2010-2011. Gene prediction: Methods. Gene Prediction can be based upon: Coding statistics
E N D
Lecture 6 Sequence Alignment Bioinformatics Dr. Aladdin Hamwieh Khalid Al-shamaa Abdulqader Jighly Aleppo University Faculty of technical engineering Department of Biotechnology 2010-2011
Gene prediction: Methods • Gene Prediction can be based upon: • Coding statistics • Gene structure • Comparison Statistical approach Similarity-based approach
Gene prediction: Methods • Gene Prediction can be based upon: • Coding statistics • Gene structure • Comparison Statistical approach Similarity-based approach
Alignment • Sequence alignment involves the identification of the correct location of deletions and insertions that have occurred in either of the two lineages since their divergence from a common ancestor. • Dynamic programming is the standard approach to sequence alignment • Global alignment: optimize the overall similarity of the two sequences • Local alignment: find only relatively conserved subsequences • Pairwise alignment: is the alignment between two sequences • Multiple alignment: is the alignment between more than two sequences
Methods of alignment: • Dot matrix • Distance Matrix
Dot Plot Algorithm • Take two sequences (A & B), write sequence A out as a row (length=m) and sequence B as a column (length =n) • Create a table or “matrix” of “m” columns and “n” rows • Compare each letter of sequence A with every letter in sequence B. If there’s a match mark it with a dot, if not, leave blank
Dot Plot Algorithm A C D E F G H G G A C D E F G H G A Complete identity X Not Matched
The vertical gap indicates that a coding region corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene. Advantages: Highlighting Information
Advantages: Highlighting Information The two pairs of diagonally oriented parallel lines most probably indicate that two small internal duplications occurred in the bacterial gene.
Scoring Matrices • Scoring matrices are created based on biological evidence. • To generalize scoring, consider a (4+1) x (4+1) scoring matrixδ. • In the case of an amino acid sequence alignment, the scoring matrix would be a (20+1)x(20+1) size. • The addition of 1 is to include the score for comparison of a gap character “-”.
Scoring Matrice Elements Input: two sequences over the same alphabet Output: an alignment of the two sequences Example: • GCGCATGGATTGAGCGAandTGCGCCATTGATGACCA • A possible alignment: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Three elements: • Perfect matches • Mismatches • Insertions & deletions (indel)
scoring scheme A G C T - A +1 –1 –1 -1 -2 G –1 +1 –1 -1 -2 C –1 –1 +1 -1 -2 T –1 –1 –1 +1 -2 - -2 -2 -2 -2 * Score each position independently: • Match: +1 • Mismatch: -1 • Indel: -2 Score of an alignment is sum of position scores Example:-GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Score: (+1x13) + (-1x2) + (-2x4) = 3 ------GCGCATGGATTGAGCGA TGCGCC----ATTGATGACCA-- Score:(+1x5) + (-1x6) + (-2x11)= -23
Transition and Transversion • Matrix Example: A C G T A +3 –2 –1 -2 C –2 +3 –2 -1 G –1 –2 +3 -2 T –2 –1 –2 +3
The Global Alignment Problem Find the best alignment between two strings under a given scoring schema Input : Strings v and w and a scoring schema Output : Alignment of maximum score ↑← = -б = 1 if match = -µ if mismatch si-1,j-1 +1 if vi = wj si,j= max si-1,j-1 -µ if vi ≠ wj si-1,j - σ si,j-1 - σ W Wj-1Wj m : mismatch penalty σ : indelpenalty V ViVi-1 {
Longest Common Subsequences – Practice 1 • Mismatches are not allowed (μ = -∞) • No indels penalties (σ = 0) • and matches are rewarded with +1 • V = ATCTGAT • W = TGCAT
Longest Common Subsequences – Practice 10 • Computing similarity s(V,W) = 4 • Computing distance d(V,W) = n + m – 2 s(V,M) = 5
Longest Common Subsequences – Practice 10 • Alignment: – T G C A T – A – A T – C – T G A T
Protein Substitution Matrix Identity Scoring Matrix Percent Accepted Mutation (PAM) Blocks Substitution Matrix (BLOSUM)
Percent Accepted Mutation (PAM) • 1 PAM is the amount of evolutionary change that yields, on average, one substitution in 100 amino acid residues. • PAM250 matrix assumes/is optimized for sequences separated by 250 PAM, i.e. 250 substitutions in 100 amino acids (longer evolutionary time) • To derive a mutational probability matrix for a protein sequence that has undergone N percent accepted mutations, a PAM-N matrix, the PAM-1 matrix is multiplied by itself N times • PAM250 is suitable for comparing distantly related sequences, while a lower PAM is suitable for comparing more closely related sequences.
Selecting a PAM Matrix • Low PAM numbers: short sequences, strong local similarities. • High PAM numbers: long sequences, weak similarities. • PAM60 for close relations (60% identity) • PAM120 recommended for general use (40% identity) • PAM250 for distant relations (20% identity) • If uncertain, try several different matrices • PAM40, PAM120, PAM250 recommended.
BLOSUM:BlocksSubstitutionMatrix • Based on BLOCKS database • ~2000 blocks from 500 families of related proteins • Families of proteins with identical function • Blocks are short conserved patterns of 3-60 amino acid long without gaps • Each block represent sequences alignment with different identity percentage AABCDA … BBCDA DABCDA. A. BBCBB BBBCDABA.BCCAA AAACDAC.DCBCDB CCBADAB.DBBDCC AAACAA … BBCCC
BLOSUM Matrices • For each block the amino-acid substitution rates were calculated to create BLOSUM matrix • Different BLOSUMn matrices are calculated independently from BLOCKS • BLOSUMn is based on sequences that shared at least n percent identical • BLOSUM62 represents closer sequences than BLOSUM45
Selecting a BLOSUM Matrix • For BLOSUMn, higher n suitable for sequences which are more similar • BLOSUM62 recommended for general use • BLOSUM80 for close relations • BLOSUM45 for distant relations
Equivalent PAM and Blosum matricesThe following matrices are roughly equivalent... • PAM100 Blosum90 • PAM120 Blosum80 • PAM160 Blosum60 • PAM200 Blosum52 • PAM250 Blosum45Generally speaking... • The Blosum matrices are best for detecting local alignments. • The Blosum62 matrix is the best for detecting the majority of weak protein similarities. • The Blosum45 matrix is the best for detecting long and weak alignments. Less divergent More divergent
Common amino acids have low weights Rare amino acids have high weights BLOSUM62 A4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X
BLOSUM62 A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X Positive for more likely substitution
BLOSUM62 A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X Negative for less likely substitution
alignment score A4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X …PQG… …PQG… 7+5+6 =18 ..PQG.. ..PEG.. 7+2+6 =15 …PQG… …PQA… 7+5+0 =12
This is more likely This is less likely Affine Gap Penalties • In nature, a series of k indels often come as a single event rather than a series of k single nucleotide events: ATA__GC ATATTGC ATAG_GC AT_GTGC Normal scoring would give the same score for both alignments
Accounting for Gaps • Gaps- contiguous sequence of spaces in one of the rows • Score for a gap of length x is: -(ρ +σx) where ρ >0 is the penalty for introducing a gap: gap opening penalty ρ will be large relative to σ: gap extension penalty because you do not want to add too much of a penalty for extending the gap.
Multiple Sequence Alignment • All sequences are compared to each other (pairwise alignments) • A dendrogram (like a phylogenetic tree) is constructed, describing the approximate groupings of the sequences by similarity (stored in a file). • The final multiple alignment is carried out, using the dendrogram as a guide.