610 likes | 824 Views
Pairwise and Multiple Sequence Alignment Lesson 2. Motivation.
E N D
Motivation ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACGTGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAGGAGGGAGGATGGAAGCTGGGCATGTGGAGACAGACCACCTCCTGGATTTATGACAGGAACTGATTGCTGTCTCCTGTGCTGCTTTCACCCCTCAGGCTGCTGGTCGTGTATCCCTGGACCCAGAGGTTCTTTGAAAGCTTTGGGGACTTGTCCACTCCTGCTGCTGTGTTCGCAAATGCTAAGGTAAAAGCCCATGGCAAGAAGGTGCTAACTTCCTTTGGTGAAGGTATGAATCACCTGGACAACCTCAAGGGCACCTTTGCTAAACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAATTTCAAGGTGAGTCAATATTCTTCTTCTTCCTTCTTTCTATGGTCAAGCTCATGTCATGGGAAAAGGACATAAGAGTCAGTTTCCAGTTCTCAATAGAAAAAAAAATTCTGTTTGCATCACTGTGGACTCCTTGGGACCATTCATTTCTTTCACCTGCTTTGCTTATAGTTATTGTTTCCTCTTTTTCCTTTTTCTCTTCTTCTTCATAAGTTTTTCTCTCTGTATTTTTTTAACACAATCTTTTAATTTTGTGCCTTTAAATTATTTTTAAGCTTTCTTCTTTTAATTACTACTCGTTTCCTTTCATTTCTATACTTTCTATCTAATCTTCTCCTTTCAAGAGAAGGAGTGGTTCACTACTACTTTGCTTGGGTGTAAAGAATAACAGCAATAGCTTAAATTCTGGCATAATGTGAATAGGGAGGACAATTTCTCATATAAGTTGAGGCTGATATTGGAGGATTTGCATTAGTAGTAGAGGTTACATCCAGTTACCGTCTTGCTCATAATTTGTGGGCACAACACAGGGCATATCTTGGAACAAGGCTAGAATATTCTGAATGCAAACTGGGGACCTGTGTTAACTATGTTCATGCCTGTTGTCTCTTCCTCTTCAGCTCCTGGGCAATATGCTGGTGGTTGTGCTGGCTCGCCACTTTGGCAAGGAATTCGACTGGCACATGCACGCTTGTTTTCAGAAGGTGGTGGCTGGTGTGGCTAATGCCCTGGCTCACAAGTACCATTGA || || ||||| ||| || || ||||||||||||||||||| MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE… MVNLTSDEKTAVLALWNKVDVEDCGGEALGRLLVVYPWTQRFFE…
What is sequence alignment? Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences. MVNLTSDEKTAVLALWNKVDVEDCGGE |||| ||||| ||| |||| || MVHLTPEEKTAVNALWGKVNVDAVGGE
Why perform a pairwise sequence alignment? e.g., predicting characteristics of a protein – premised on: similar sequence (or structure) similar function Finding homology between two sequences
Local vs. Global • Local alignment – finds regions of high similarity in parts of the sequences • Global alignment – finds the best alignment across the entire two sequences ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN CDRYYQ ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ
Evolutionary changes in sequences Three types of nucleotide changes: • Substitution – a replacement of one (or more) sequence characters by another: • Insertion - an insertion of one (or more) sequence characters: • Deletion – a deletion of one (or more) sequence characters: AAGA AACA AAG A T A A GA Insertion + Deletion Indel
Choosing an alignment: • Many different alignments between two sequences are possible: AAGCTGAATTCGAA AGGCTCATTTCTGA AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- . . . How do we determine which is the best alignment?
A C G T A C G T Toy exercise Compute the scores of each of the following alignments using this naïve scoring scheme • Match: +1 • Mismatch: -2 • Indel: -1 Scoring scheme: Substitution matrix Gap penalty (opening = extending) AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA-
Substitution matrices: accounting for biological context • Which best reflects the biological reality regarding nucleotide mismatch penalty? • Tr > Tv > 0 • Tv > Tr > 0 • 0 > Tr > Tv • 0 > Tv > Tr Tr = Transition Tv = Transversion
Scoring schemes: accounting for biological context • Which best reflects the biological reality regarding these mismatch penalties? • Arg->Lys > Ala->Phe • Arg->Lys > Thr->Asp • Asp->Val > Asp->Glu
PAM matrices • Family of matrices PAM 80, PAM 120, PAM 250, … • The number with a PAM matrix (the n in PAMn) represents the evolutionary distance between the sequences on which the matrix is based • The (ith,jth)cell in a PAMn matrix denotes the probability that amino-acid i will be replaced by amino-acid j in time n: Pi→j,n • Greater n numbers denote greater distances
PAM - limitations • Based on only one original dataset • Examines proteins with few differences (85% identity) • Based mainly on small globular proteins so the matrix is biased
BLOSUM matrices • Different BLOSUMn matrices are calculated independently from BLOCKS (ungapped, manually created local alignments) • BLOSUMn is based on a cluster of BLOCKS of sequences that share at least n percent identity • The (ith,jth)cell in a BLOSUM matrix denotes the log of odds of the observed frequency and expected frequency of amino acids i and j in the same position in the data: log(Pij/qi*qj) • Higher n numbers denote higher identity between the sequences on which the matrix is based
PAM Vs. BLOSUM PAM100 = BLOSUM90 PAM120 = BLOSUM80 PAM160 = BLOSUM60 PAM200 = BLOSUM52 PAM250 = BLOSUM45 More distant sequences • BLOSUM62 for general use • BLOSUM80 for close relations • BLOSUM45 for distant relations • PAM120 for general use • PAM60 for close relations • PAM250 for distant relations
Substitution matrices exercise • Pick the best substitution matrix (PAM and BLOSUM) for each pairwise alignment: • Human – chimp • Human - yeast • Human – fish PAM options: PAM60 PAM120 PAM250 BLOSUM options: BLOSUM45 BLOSUM62 BLOSUM80
Substitution matrices • Nucleic acids: • Transition-transversion • Amino acids: • Evolutionary (empirical data) based: (PAM, BLOSUM) • Physico-chemical properties based (Grantham, McLachlan)
Gap penalty AAGCGAAATTCGAAC A-G-GAA-CTCGAAC AAGCGAAATTCGAAC AGG---AACTCGAAC • Which alignment has a higher score? • Which alignment is more likely?
Pairwise alignment algorithm matrix representation: formulation 2 sequences: S1 and S2 and a Scoring scheme: match = 1, mismatch = -1, gap = -2 V[i,j] = value of the optimal alignment between S1[1…i] and S2[1…j] V[i,j] + S(S1[i+1],S2[j+1]) V[i+1,j+1] = max V[i+1,j] + S(gap) V[i,j+1] + S(gap)
Scoring scheme: Match = 1 Mismatch = -1 Indel (gap) = -2 Pairwise alignment algorithm matrix representation: initialization S1 S2
Scoring scheme: Match = 1 Mismatch = -1 Indel (gap) = -2 Pairwise alignment algorithm matrix representation: filling the matrix S1 S2
Pairwise alignment algorithm matrix representation: trace back S1 S2
Pairwise alignment algorithm matrix representation: trace back S1 S2 AAAC AG-C
Assessing the significance of an alignment score True AAGCTGAATTCGAA AGGCTCATTTCTGA AAGCTGAATTC-GAA AGGCTCATTTCTGA- 28.0 Random AGATCAGTAGACTA GAGTAGCTATCTCT AGATCAGTAGACTA----- ----GAGTAG-CTATCTCT 26.0 . . CGATAGATAGCATA GCATGTCATGATTC CGATAGATAGCATA--------- ---------GCATGTCATGATTC 16.0
BLAST 2 sequences (bl2Seq) at NCBI Produces the local alignment of two given sequences using BLAST (Basic Local Alignment Search Tool)engine for local alignment • Does not use an exact algorithm but a heuristic
Bl2Seq - query • blastn – nucleotide blastp – protein
Bl2seq results Dissimilarity Low complexity Similarity Gaps Match
Query type: AA or DNA? • For coding sequences, AA (protein) data are better • Selection operates most strongly at the protein level → the homology is more evident • AA – 20 char’ alphabet DNA - 4 char’ alphabet lower chance of random homology for AA ↓
BLAST – programs Query: DNA Protein Database: DNA Protein
Blast scores: • Bits score– A score for the alignment according to the number of similarities, identities, etc. It has a standard set of units and is thus independent of the scoring scheme • Expected-score (E-value) –The number of alignments with the same or higher score one can “expect” to see by chance when searching a random database with a random sequence of particular sizes. The closer the e-value is to zero, the greater the confidence that the hit is really a homolog
Multiple sequence alignment Seq1 VTISCTGSSSNIGAG-NHVKWYQQLPG Seq2 VTISCTGTSSNIGS--ITVNWYQQLPG Seq3 LRLSCSSSGFIFSS--YAMYWVRQAPG Seq4 LSLTCTVSGTSFDD--YYSTWVRQPPG Seq5 PEVTCVVVDVSHEDPQVKFNWYVDG-- Seq6 ATLVCLISDFYPGA--VTVAWKADS-- Seq7 AALGCLVKDYFPEP--VTVSWNSG--- Seq8 VSLTCLVKGFYPSD--IAVEWWSNG-- Similar to pairwise alignment BUT nsequences are aligned instead of just 2 Each row represents an individual sequence Each column represents the ‘same’ position
Why perform an MSA? MSAs are at the heart of comparative genomics studies which seek to study evolutionary histories, functional and structural aspects of sequences, and to understand phenotypic differences between species
variable conserved Multiple sequence alignment Seq1 VTISCTGSSSNIGAG-NHVKWYQQLPG Seq2 VTISCTGTSSNIGS--ITVNWYQQLPG Seq3 LRLSCSSSGFIFSS--YAMYWVRQAPG Seq4 LSLTCTVSGTSFDD--YYSTWVRQPPG Seq5 PEVTCVVVDVSHEDPQVKFNWYVDG-- Seq6 ATLVCLISDFYPGA--VTVAWKADS-- Seq7 AALGCLVKDYFPEP--VTVSWNSG--- Seq8 VSLTCLVKGFYPSD--IAVEWWSNG-- Seq1 VTISCTGSSSNIGAG-NHVKWYQQLPG Seq2 VTISCTGTSSNIGS--ITVNWYQQLPG Seq3 LRLSCSSSGFIFSS--YAMYWVRQAPG Seq4 LSLTCTVSGTSFDD--YYSTWVRQPPG Seq5 PEVTCVVVDVSHEDPQVKFNWYVDG-- Seq6 ATLVCLISDFYPGA--VTVAWKADS-- Seq7 AALGCLVKDYFPEP--VTVSWNSG--- Seq8 VSLTCLVKGFYPSD--IAVEWWSNG--
Alignment methods There is no available optimal solution for MSA – all methods are heuristics: • Progressive/hierarchical alignment (ClustalX) • Iterative alignment (MAFFT, MUSCLE)
Progressive alignment A B C D E First step: compute pairwise distances Compute the pairwise alignments for all against all (10 pairwise alignments). The similarities are converted to distances and stored in a table
A B C D E Second step: build a guide tree • Cluster the sequences to create a tree (guide tree): • represents the order in which pairs of sequences are to be aligned • similar sequences are neighbors in the tree • distant sequences are distant from each other in the tree The guide tree is imprecise and is NOT the tree which truly describes the evolutionary relationship between the sequences!
Sequence A Sequence B Sequence C Sequence D Sequence E A B C D E Third step: align sequences in a bottom up order • Align the most similar (neighboring) pairs • Align pairs of pairs • Align sequences clustered to pairs of pairs deeper in the tree
Sequence A Sequence B Sequence C Sequence D Sequence E A B C D E Main disadvantages of progressive alignments Guide-tree topology may be considerably wrong Globally aligning pairs of sequences may create errors that will propagate through to the final result
A MSA B C D E Iterative alignment A B C DE Pairwise distance table Iterate until the MSA does not change (convergence) Guide tree
MSA input: multiple sequence Fasta file >gi|4504351|ref|NP_000510.1| delta globin [Homo sapiens] MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFSQLSELHCDKLHVDPENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVAN ALAHKYH >gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens] MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH >gi|4885393|ref|NP_005321.1| epsilon globin [Homo sapiens] MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLT SFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAI ALAHKYH >gi|6715607|ref|NP_000175.1| G-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVAS ALSSRYH >gi|28302131|ref|NP_000550.2| A-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDATKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTAVAS ALSSRYH >gi|4885397|ref|NP_005323.1| hemoglobin, zeta [Homo sapiens] MSLTKTERTIIVSMWAKISTQADTIGTETLERLFLSHPQTKTYFPHFDLHPGSAQLRAHGSKVVAAVGDA VKSIDDIGGALSKLSELHAYILRVDPVNFKLLSHCLLVTLAARFPADFTAEAHAAWDKFLSVVSSVLTEK YR