480 likes | 638 Views
Pairwise Sequence Alignment. WHAT?. WHAT?. Given any two sequences (DNA or protein) Seq 1: CATATTGCAGTGGTCCCGCGTCAGGCT S eq 2: TAAATTGCGTGGTCGCACTGCACGCT we are interested to know to what extent they are similar?. CATATTGCAGTGGTCCCGCGTCAGGCT TAAATTGCGT-GGTCGCACTGCACGCT. WHY?.
E N D
WHAT? • Given any two sequences (DNA or protein) Seq 1: CATATTGCAGTGGTCCCGCGTCAGGCT Seq 2: TAAATTGCGTGGTCGCACTGCACGCT we are interested to know to what extent they are similar? CATATTGCAGTGGTCCCGCGTCAGGCT TAAATTGCGT-GGTCGCACTGCACGCT
Discover function • Study evolution • Find crucial features within a sequence • Identify cause of diseases
Discover function • Sequences that are similar probably have the same function
Study evolution If two sequences from different organisms are similar , they may have a common ancestor
Find crucial features • Regions in the sequences that are strongly conserved between different sequences can indicate their functional importance Conservation of the IGFALS (Insulin-like growth factor) Between human and mouse. CATATTGCAGTGGTCCCGCGTCAGGCT TAAATTGCGT-GGTCGCACTGCACGCT
Identify cause of disease • Comparison of sequences between individuals can detect changes that are related to diseases
Sickle Cell Anemia • Due to 1 swapping an A for a T, causing inserted amino acid to be valine instead of glutamine in hemoglobin Image source: http://www.cc.nih.gov/ccc/ccnews/nov99/
Indel (replication slippage) TCCGT TCGAGT TCAGT TCGT Sequence Modifications • Three types of changes • Substitution (point mutation) • Insertion • Deletion TCAGT
In order to align two sequences we need a quantitive model to evaluate similarity between sequences. How do we quantitate sequence similarity ?
Scoring Similarity • Assume independent mutation model • Each site considered separately • Score at each site • Positive if the same • Negative if different • Sum to make final score • Can be positive or negative • Significance depends on sequence length GTAGTCCTAGCG
Total score +4 A weak match Substitutions Onlynot including indels • Sequences compared base-by-base • Count the number of matches and mismatches • For example :Matches score +2, Mismatches score -1 TTCGTCGTAGTCGGCTCGACCTGGTACGTCTAGCGAGCGTGATCCT 9 matches +18 14 mismatches -14
Total score +24 A strong match Including Indels • Create an ‘alignment’ • Count matches within alignment • Indels are scored as mismatches -1 TT-CGTCGTAGTCG-GC-TCGACC-TGGTACGTC-TAG-CGAGCGT-GATCCT- 17 matches +34 2 mismatches - 2 8 indels - 8
TT-CGTCGTAGTCG-GC-TCGACC-TGGTACGTC-TAG-CGAGCGT-GATCCT- +24 -TTCGT-CGTAGTC-GGCTCG-ACCTGGTAC-GTCTA-GCGAGCGT-GATCC-T 0 Choosing an Alignment • Many different alignments are possible • Should consider all possible • Take the best score found • There may be more than one best alignment
Why is it hard ? Alignment requires an algorithm that performs a number of comparisons roughly proportional to the square of the average sequence length n2.
Dynamic Programming • A method for reducing a complex problem • to a set of identical sub-problems • The best solution to one sub-problem is independent from the best solution to the other sub-problem
Dynamic Programming • A method for reducing a complex problem • to a set of identical sub-problems • The best solution to one sub-problem is independent from the best solution to the other sub-problem
What does it mean? If a path from X→Z passes through Y, the best path from X→Y is independent of the best path from Y→Z
Sequence Global Alignment Needleman-Wunsch Sequences: A = ACGCTG, B = CATGT A C G C T G 1 2 3 4 5 6 C 1 A 2 T 3 G 4 T Z 5
Score of best alignment between AC and CATG …between ACG and CATG -1 2 …between AC and CATGT Calculate score between ACG and CATGT -2 ? Example Sequences: A = ACGCTG, B = CATGT Match:+2, Other:-1
Example Align the next letter in the sequences Insertion in the first sequence (del) 3 5 - 5 Insertion in the Second sequence 3 -
-1 from before plus -1 for mismatch of G against T-2 2 from before plus -1 for mismatch of – against T1 -2 from before plus -1 for mismatch of G against –-3 Cell gets highest score of -2,1,-31 1 Example -1 2 -2 Sequences: A = ACGCTG, B = CATGT
Example -1 2 -2 Sequences: A = ACGCTG, B = CATGT
A -
ACGCTG ------
----- CATGT
A C
AC -C
ACG -C-
ACGC ---C ACGC -C--
ACG -CA
ACGCTG- -C-ATGT
ACGCTG- -CA-TGT
-ACGCTG CATG-T-
Summary Needleman-Wunsch Alignment • Global alignment between sequences • Compare entire sequence against another • Create scoring table • Sequence A across top, B down left • Cell at column i and row j contains the score of best alignment between the first i elements of A and the first j elements of B • Global alignment score is bottom right cell
Global vs. Local alignment DorothyHodkin DorothyCrowfootHodkin DOROTHY DOROTHY HODGKIN HODGKIN Global alignment: DOROTHY--------HODGKIN DOROTHYCROWFOOTHODGKIN Local alignment:
Local AlignmentSmith-Waterman • Best score for aligning part of sequences • Often beats global alignment score Global Alignment ATTGCAGTG-TCGAGCGTCAGGCT ATTGCGTCGATCGCAC-GCACGCT Local Alignment CATATTGCAGTGGTCCCGCGTCAGGCT TAAATTGCGT-GGTCGCACTGCACGCT
Global vs. Local alignment Alignment of two Genomic sequences >Human DNA CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA >Mouse DNA CATGCGTCTGACgctttttgctagcgatatcggactATCGATATA
Global vs. Local alignment Alignment of two Genomic sequences Global Alignment Human:CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA Mouse:CATGCGTCTGACgct---ttttgctagcgatatcggactATCGAT-ATA ****** ***** * *** * ****** *** Human:CATGCGACTGAC Mouse:CATGCGTCTGAC Human:ATCGATCATA Mouse:ATCGAT-ATA Local Alignment
Global vs. Local alignment Alignment of two Genomic DNA and mRNA >Human DNA CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA >Human mRNA CATGCGACTGACATCGATCATA
Global vs. Local alignment Alignment of two Genomic DNA and mRNA Global Alignment DNA: CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA mRNA:CATGCGACTGAC---------------------------ATCGATCATA ************ ********** DNA: CATGCGACTGAC mRNA:CATGCGACTGAC DNA: ATCGATCATA mRNA:ATCGATCATA Local Alignment