Pairwise Sequence Alignment: Lesson 2

Before we begin… ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACGTGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAGGAGGGAGGATGGAAGCTGGGCATGTGGAGACAGACCACCTCCTGGATTTATGACAGGAACTGATTGCTGTCTCCTGTGCTGCTTTCACCCCTCAGGCTGCTGGTCGTGTATCCCTGGACCCAGAGGTTCTTTGAAAGCTTTGGGGACTTGTCCACTCCTGCTGCTGTGTTCGCAAATGCTAAGGTAAAAGCCCATGGCAAGAAGGTGCTAACTTCCTTTGGTGAAGGTATGAATCACCTGGACAACCTCAAGGGCACCTTTGCTAAACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAATTTCAAGGTGAGTCAATATTCTTCTTCTTCCTTCTTTCTATGGTCAAGCTCATGTCATGGGAAAAGGACATAAGAGTCAGTTTCCAGTTCTCAATAGAAAAAAAAATTCTGTTTGCATCACTGTGGACTCCTTGGGACCATTCATTTCTTTCACCTGCTTTGCTTATAGTTATTGTTTCCTCTTTTTCCTTTTTCTCTTCTTCTTCATAAGTTTTTCTCTCTGTATTTTTTTAACACAATCTTTTAATTTTGTGCCTTTAAATTATTTTTAAGCTTTCTTCTTTTAATTACTACTCGTTTCCTTTCATTTCTATACTTTCTATCTAATCTTCTCCTTTCAAGAGAAGGAGTGGTTCACTACTACTTTGCTTGGGTGTAAAGAATAACAGCAATAGCTTAAATTCTGGCATAATGTGAATAGGGAGGACAATTTCTCATATAAGTTGAGGCTGATATTGGAGGATTTGCATTAGTAGTAGAGGTTACATCCAGTTACCGTCTTGCTCATAATTTGTGGGCACAACACAGGGCATATCTTGGAACAAGGCTAGAATATTCTGAATGCAAACTGGGGACCTGTGTTAACTATGTTCATGCCTGTTGTCTCTTCCTCTTCAGCTCCTGGGCAATATGCTGGTGGTTGTGCTGGCTCGCCACTTTGGCAAGGAATTCGACTGGCACATGCACGCTTGTTTTCAGAAGGTGGTGGCTGGTGTGGCTAATGCCCTGGCTCACAAGTACCATTGA || || ||||| ||| || || ||||||||||||||||||| MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE… MVNLTSDEKTAVLALWNKVDVEDCGGEALGRLLVVYPWTQRFFE…

Pairwise Sequence AlignmentLesson 2

What is sequence alignment? Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences. MVNLTSDEKTAVLALWNKVDVEDCGGE |||| ||||| ||| |||| || MVHLTPEEKTAVNALWGKVNVDAVGGE

Why sequence alignment? Predict characteristics of a protein – use the structure or function information on known proteins with similar sequences available in databases in order to predict the structure or function of an unknown protein Assumptions: similar sequences produce similar proteins

Local vs. Global Global alignment: forces alignment in regions which differ • Global alignment – finds the best alignment across the whole two sequences. • Local alignment – finds regions of high similarity in parts of the sequences. ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ Local alignment concentrates on regions of high similarity ADLG CDRYFQ |||| |||| | ADLG CDRYYQ

Sequence evolution In the course of evolution, the sequences changed from the ancestral sequence by random mutations Three types of changes: • Insertion - an insertion of a letter or several letters to the sequence. AAGA AAGTA Insertion AAG A T

Sequence evolution In the course of evolution, the sequences changed from the ancestral sequence by random mutations Three types of changes : • Insertion - an insertion of a letter or several letters to the sequence. AAGA AAGTA • Deletion – a deletion of a letter (or more) from the sequence. AAGA AGA Deletion A A AG

Evolutionary changes in sequences In the course of evolution, the sequences changed from the ancestral sequence by random mutations Three types of mutations: • Insertion - an insertion of a letter or several letters to the sequence. AAGA AAGTA • Deletion - deleting a letter (or more) from the sequence. AAGA AGA • Substitution – a replacement of one (or more) sequence letter by another AAGA AACA Substitution AA A C G Insertion + Deletion Indel

Sequence alignment AAGCTGAATTCGAA AGGCTCATTTCTGA One possible alignment: AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- This alignment includes: 2mismatches 4 indels (gap) 10 perfect matches

Choosing an alignment: • Many different alignments are possible: AAGCTGAATTCGAA AGGCTCATTTCTGA AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- Which alignment is better?

Scoring an alignment:example - naïve scoring system: • Match: +1 • Mismatch: -2 • Indel: -1 AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- Score: =(+1)x10 + (-2)x2 + (-1)x4= 2 Score: =(+1)x9 + (-2)x2 + (-1)x6 = -1 Higher score  Better alignment

Scoring system: • Different scoring systems can produce different optimal alignments • Scoring systems implicitly represent a particular theory of similarity/dissimilarity between sequence characters: evolution based, physico-chemical properties based • Some mismatches are more plausible • Transition vs. Transversion • LysArg ≠ LysCys • Gap extension Vs. Gap opening

Substitutions Matrices • Nucleic acids: • Transition-transversion • Amino acids: • Evolution (empirical data) based: (PAM, BLOSUM) • Physico-chemical properties based (Grantham, McLachlan)

PAM Matrices • Family of matrices PAM 80, PAM 120, PAM 250 • The number with PAM matrices represent evolutionary distance • Greater numbers denote greater distances

Which PAM matrix to use? • Low PAM numbers: strong similarities • High PAM numbers: weak similarities • PAM120 for general use (40% identity) • PAM60 for close relations (60% identity) • PAM250 for distant relations (20% identity) • If uncertain, try several different matrices • PAM40, PAM120, PAM250

PAM - limitations • Based on only one original dataset • Examines proteins with few differences (85% identity) • Based mainly on small globular proteins so the matrix is biased

BLOSUM Matrices • Different BLOSUMn matrices are calculated independently from BLOCKS • BLOSUMn is based on sequences that share at least n percent identity • BLOSUM62 represents closer sequences than BLOSUM45

Example : Blosum62 derived from blocks of sequences that share at least 62% identity

Which BLOSUM matrix to use? • Low BLUSOM numbers for distant sequences • High BLUSOM numbers for similar sequences • BLOSUM62 for general use • BLOSUM80 for close relations • BLOSUM45 for distant relations

PAM Vs. BLOSUM PAM100 = BLOSUM90 PAM120 = BLOSUM80 PAM160 = BLOSUM60 PAM200 = BLOSUM52 PAM250 = BLOSUM45 More distant sequences

Gap penalty • We expect to penalize gaps • A different score for gap opening and for extension • Insertions and deletions are rare in evolution • But once they occur, they are easy to extend • Gap-extension penalty < gap-opening penalty

Web servers for pairwise alignment

BLAST 2 sequences (bl2Seq) at NCBI Produces the local alignment of two given sequences using BLAST (Basic Local Alignment Search Tool)engine for local alignment • Does not use an exact algorithm but a heuristic

Back to NCBI

BLAST – bl2seq

Bl2Seq - query • blastn – nucleotide blastp – protein

Bl2seq results

Bl2seq results Dissimilarity Low complexity Gaps Similarity Match

Bl2seq results: • Bits score– A score for the alignment according to the number of similarities, identities, etc. • Expected-score (E-value) –The number of alignments with the same score one can “expect” to see by chance when searching a database of a particular size. The closer the e-value approaches zero, the greater the confidence that the hit is real

BLAST – programs Query: DNA Protein Database: DNA Protein

BLAST – Blastp

Blastp - results

Blastp – results (cont’)

Blastp – acquiring sequences

blastp – acquiring sequences (cont’)

Fasta format – multiple sequences >gi|4504351|ref|NP_000510.1| delta globin [Homo sapiens] MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFSQLSELHCDKLHVDPENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVAN ALAHKYH >gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens] MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH >gi|4885393|ref|NP_005321.1| epsilon globin [Homo sapiens] MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLT SFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAI ALAHKYH >gi|6715607|ref|NP_000175.1| G-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVAS ALSSRYH >gi|28302131|ref|NP_000550.2| A-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDATKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTAVAS ALSSRYH

Searching for remote homologs • Sometimes BLAST isn’t enough • Large protein family, and BLAST only finds close members. We want more distant members • PSI-BLAST • Profile HMMs (not discussed in this exercise)

PSI-BLAST • Position Specific Iterated BLAST Regular blast Construct profile from blast results Blast profile search Final results

PSI-BLAST • Advantage: PSI-BLAST looks for seq’s that are close to the query, and learns from them to extend the circle of friends • Disadvantage: if we obtained a WRONG hit, we will get to unrelated sequences (contamination). This gets worse and worse each iteration

BLAST – PSI-Blast

PSI-Blast - results

Pairwise Sequence Alignment: Lesson 2

Pairwise Sequence Alignment: Lesson 2

Presentation Transcript

How a Bill Becomes Law And Other Useful Credential Counselors and Analysts of California Conference

ORIGINS: HOW DID IT ALL BEGIN?

Chapter One

When Does a Person Begin?

Creating the Lead

(Start of Slide) (Press Shift+F5 to begin Slide Show) (Click mouse once to begin)

Lesson 6

The AP Essay A “How-to-Begin” Guide

Before you begin…

Let’s Begin

Ways to ‘hook’

FOR PETLJE

Title

Begin

Thank You for Joining Us, The Webinar Will Begin Shortly.

BEGIN:VCALENDAR

Protests Begin

BEGIN:VCALENDAR

Anatomy-Before we Begin